Real Time Analytics

Framework for Real time Analytics
Hakim.co.in
Real Time Analytics

Index
Introduction
Evolving BI and Analytics for Big Data
Impacts to Traditional BI Databases
Challenges
MongoDB with Hadoop
Case Studies
Current Scenario

Introduction
 Analytics falls along a spectrum. On one end of the spectrum sit batch analytical applications, which are used for
complex, long-running analyses. They tend to have slower response times (up to minutes, hours, or days) and
lower requirements for availability. Examples of batch analytics include Hadoop-based workloads
 On the other end of the spectrum sit real-time analytical applications, which provide lighter-weight analytics
very quickly. Latency is low (sub-second) and availability requirements are high (e.g., 99.99%). MongoDB is
typically used for real-time analytics. Example applications include:
Business Intelligence (BI) and analytics provides an essential set of technologies and processes
that organizations have relied upon over many years to guide strategic business decisions.

Introduction
1. Predictable Frequency. Data is extracted from source systems at regular
intervals - typically measured in days, months and quarters
2. Static Sources. Data is sourced from controlled, internal systems supporting
established and well-defined back-office processes
3. Fixed Models. Data structures are known and modeled in advance of
analysis. This enables the development of a single schema to accommodate
data from all of the source systems, but adds significant time to the upfront
design
4. Defined Queries. Questions to be asked of the data (i.e., the reporting
queries) are pre-defined. If not all of the query requirements are known
upfront, or requirements change, then the schema has to be modified to
accommodate changes
5. Slow-changing requirements. Rigorous change-control is enforced before
the introduction of new data sources or reporting requirements
6. Limited users. The consumers of BI reports are typically business managers

Evolving BI and Analytics for Big Data
Higher Uptime Requirements
The immediacy of real-time analytics
accessed from multiple fixed and mobile
devices places additional demands on the
continuous availability of BI systems.
Batch-based systems can often tolerate a
certain level of downtime, for example for
scheduled maintenance. Online systems on
the other hand need to maintain operations
during both failures and planned upgrades.
The Need for Speed & Scale
Time to value is everything. For example,
having access to real-time customer
sentiment or logistics tracking is of little
benefit unless the data can be analyzed
and reported in real-time. As a
consequence, the frequency of data
acquisition, integration and analysis must
increase from days to minutes or less,
placing significant operational overhead on
BI systems.
Agile Analytics and
Reporting
With such a diversity of new data
sources, business analysts can not
know all of the questions they need
to ask in advance. Therefore an
essential requirement is that the data
can be stored before knowing how it
will be processed and queried.The Changing Face of Data
Data generated by such workloads as
social, mobile, sensor and logging, is
much more complex and variably
structured than traditional
transaction data from back-office
systems such as ERP, CRM, PoS (Point
of Sale) and Accounts Receivable.
Taking BI to the Cloud
The drive to embrace cloud
computing to reduce costs and
improve agility means BI components
that have traditionally relied on
databases deployed on monolithic,
scale-up systems have to be re-
designed for the elastic scale-out,
service-oriented architectures of

Impacts to Traditional BI Databases
The relational databases underpinning many of today’s traditional BI platforms are not well suited to the requirements of big
data:
• Semi-structured and unstructured data typical in mobile, social and sensor-driven applications cannot be efficiently
represented as rows and columns in a relational database table
• Rapid evolution of database schema to support new data sources and rapidly changing data structures is not
possible in relational databases, which rely on costly ALTER TABLE operations to add or modify table attributes
• Performance overhead of JOINs and transaction semantics prevents relational databases from keeping pace with the
ingestion of high-velocity data sources
• Quickly growing data volumes require scaling databases out across commodity hardware, rather than the scale-up
approach typical of most relational databases
Relational databases’ inability to
handle the speed, size and diversity
of rapidly changing data generated
by modern applications is already
driving the enterprise adoption of
NoSQL and Big Data technologies in
both operational and analytical
roles.

The purpose
• Flume in Hadoop, for batch processing, which make the data relevant time-wise; it can be used
for real time because it would be too fresh, only from several min to even a second late.
• Flume engine, using server side in order to make decisions regarding the current state of
affairs.
• Decisions Making are made based on whatever data is received from customers’ current
condition without all of the history in their user profiles, which would enable a much more
informed decision.
• State of Art Auto updating charting and report creation with Dashboard UI.
Increase scalability and performance of Organizations using Real
Time Analysis platform with a focus on storing, processing and
analyzing the exponentially growing data using big data
technologies.

Challenges
1. Getting data metrics to the right people
Often, social media is treated like the ugly stepchild within the marketing department and real-time
social media analytics are either absent or ignored.
2. Visualization
Visualizing real-time social media analytics is another key element involved in developing insights
that matter.
Simply displaying values graphically helps in making the kinds of fast interpretations necessary for
making decisions with real-time data, but adding more complex algorithms and using models
provides deeper insights, especially when visualized.
3. Unstructured data is challenging
Unlike the survey data firms are used to dealing with, most (IBM estimates 80%) is unstructured —
meaning it consists of words rather than numbers. And, text analytics lags seriously behind numeric
analysis.
4. Increasing signal to noise
Social media data is inherently noisy. Reducing noise to even detect signal is challenging — especially
in real time. Sure, with enough time, new analytics tools can ferret out the few meaningful

Top 10 Priorities
1 Enable new fast-paced business practices
2 Don’t expect the new stuff to replace the old stuff
3 Do not assume that all the data needs to be in real time, all the time
4 Correlate real-time data with data from other sources and latencies
5 Start with a proof of value with measurable outcomes
6 As a safe starter project, accelerate successful latent processes into near real time
7 Think about operationalizing analytics
8 Think about the skills you need
9 Examine application business rules to ensure they are ready for real-time data flows
10 Evaluate technology platforms and expertise for availability and reliability

Challenges
Real-Time Analytics is Hard
Can’t Stay Ahead. You need to account for
many types of data, including unstructured
and semi-structured data. And new sources
present themselves unpredictably.
Relational databases aren’t capable of
handling this, which leaves you hamstrung.
Can’t Scale. You need to analyze terabytes
or petabytes of data. You need sub-second
response times. That’s a lot more than a
single server can handle. Relational
databases weren’t designed for this
Batch. Batch processes are the right
approach for some jobs. But in many cases,
you need to analyze rapidly changing,
multi-structured data in real time. You
don’t have the luxury of lengthy ETL
processes to cleanse data for later.
MongoDB Makes it Easy
Do the Impossible. MongoDB can incorporate any
kind of data – any structure, any format, any
source – no matter how often it changes. Your
analytical engines can be comprehensive and real-
time.
Scale Big. MongoDB is built to scale out on
commodity hardware, in your data center or in the
cloud. And without complex hardware or extra
software. This shouldn’t be hard, and with
MongoDB, it isn’t.
Real Time. MongoDB can analyze data of any
structure directly within the database, giving you
results in real time, and without expensive data
warehouse loads.

Why Other Databases Fall Short and
MangoDBMost databases make you chose between a flexible data
model, low latency at scale, and powerful access. But
increasingly you need all three at the same time.
 Rigid Schemas. You should be able to analyze unstructured, semi-structured, and
polymorphic data. And it should be easy to add new data. But this data doesn’t
belong in relational rows and columns. Plus, relational schemas are hard to
change incrementally, especially without impacting performance or taking the
database offline.
 Scaling Problems. Relational databases were designed for single-server
configurations, not for horizontal scale-out. They were meant to serve 100s of ops
per second, not 100,000s of ops per second. Even with a lot of engineering hours,
custom sharding layers, and caches, scaling an RDBMS is hard at best and
impossible at worst.
 Takes Too Long. Analyzing data in real time requires a break from the familiar
ETL and data warehouse approach. You don’t have time for lengthy load
schedules, or to build new query models. You need to run aggregation queries
against variably structured data. And you should be able to do so in place, in real
time.
Organizations are using MongoDB for analytics because it
lets them store any kind of data, analyze it in real time,
and change the schema as they go.
New Data. MongoDB’s document model enables you to store and process data
of any structure: events, time series data, geospatial coordinates, text and
binary data, and anything else. You can adapt the structure of a document’s
schema just by adding new fields, making it simple to bring in new data as it
becomes available.
Horizontal Scalability. MongoDB’s automatic sharding distributes data across
fleets of commodity servers, with complete application transparency. With
multiple options for scaling – including range-based, hash-based and location-
aware sharding – MongoDB can support thousands of nodes, petabytes of
data, and hundreds of thousands of ops per second without requiring you to
build custom partitioning and caching layers.
Powerful Analytics, In Place, In Real Time. With rich index and query
support – including secondary, geospatial and text search indexes – as well as
the aggregation framework and native MapReduce, MongoDB can run complex
ad-hoc analytics and reporting in place.

MongoDB with Hadoop
MongoDB Hadoop
Ebay
User data and metadata
management for product
catalog
User analysis for personalized
search & recommendations
Orbitz
Management of hotel data
and pricing
Hotel segmentation to support
building search facets
Pearson
Student identity and access
control. Content
management of course
materials
Student analytics to create
adaptive learning programs
Foursquare
User data, check-ins,
reviews, venue content
management
User analysis, segmentation and
personalization
Tier 1
Investment
Bank
Tick data, quants analysis,
reference data distribution
Risk modeling, security and fraud
detection
Industrial
Machinery
Manufactur
er
Storage and real-time
analytics of sensor data
collected from connected
vehicles
Preventive maintenance
programs for fleet optimization.
In-field monitoring of vehicle
components for design
enhancements
SFR
Customer service applications
accessed via online portals
and call centers
Analysis of customer usage,
devices & pricing to optimize
plans
The following table provides examples of customers using MongoDB together with Hadoop to power big data applications.
Whether improving customer service, supporting cross-sell and upsell, enhancing business efficiency or reducing risk, MongoDB
and Hadoop provide the foundation to operationalize big data.

Future Trends in Real-Time Data, BI, and
Analytics
Data types handled in real time today. Numerous TDWI surveys have shown that structured
data (which
includes relational data) is by far the most common class of data types handled for BI and
analytic purposes, as well as many operational and transactional ones. It’s no surprise that
structured data bubbled to the top of Figure 16. Other data types and sources commonly
handled in real time today include application logs (33%), event data (26%), semi-structured
data (26%), and hierarchical and raw data (24% each).
Data types to be handled in real time within three years. Looking ahead, a number of data
types are poised for greater real-time usage. Some are in limited use today but will
experience aggressive adoption within three years, namely social media data (38%), Web logs
and clickstreams (34%), and unstructured data (34%). Others are handled in real time today
and will become even more so, namely event (36%), semi-structured (33%), structured (31%),
and hierarchical (30%) data.

MongoDB Integration with BI and
Analytics ToolsTo make online big data actionable through dashboards, reports,
visualizations and integration with other data sources, it must be
accessible to established BI and analytics tools. MongoDB offers integration
with more of the leading BI tools than any other NoSQL or online big data
technology, including:
Actuate Alteryx Informatica
Jaspersoft Logi Analytics MicroStrategy
Pentaho Qliktech SAP Lumira

WindyGrid’s
One person, one laptop, and MongoDB’s technology jumpstarted a project that, with
other people joining in, went from prototype to one of the nation’s pioneering projects
to analyze and act on municipal data in real time. In just four months.
WindyGrid put Chicago on the path of revolutionizing how it operates not by replacing
the administrative systems already in place, but by using MongoDB to bring that data
together into a new application. With MongoDB’s flexible data model, WindyGrid doesn’t
have to go back and redo the schema for each new piece of data. Instead, it can evolve
schemas in real time. Which is crucial as WindyGrid expands and adds predictive
analytics, growing by millions of pieces of structured and unstructured data each day.

Crittercism is A Mobile Pioneer
Crittercism doesn’t just monitor apps or gather information. Using MongoDB’s powerful built in
query functions, it analyzes avalanches of unstructured and non-uniform data in real time. It
recognizes patterns, identifies trends, and diagnoses problems. That means that Cirttercism’s
customers immediately understand the root cause of problems and the impact they’re having on
business. So they know how to prioritize and correct the problems they’re facing and improve
performance
The kind of real time analysis that Crittercism provides customers would also be impossible with traditional
databases. Crittercism is using MongoDB’s powerful query functions to analyze the broad variety of data it
collects, in real time, within the database. A more traditional data warehouse approach, with ETLs and long
loading times, can’t match this type of speed.
At the same time, MongoDB lets Crittercism efficiently handle the tons of data it’s collecting. During the past two
years, the number of requests that Crittercism gathers and analyzes has jumped from 700 to 45,000 per second.
Relational databases have a hard time scaling to meet these kinds of demands, typically requiring expensive add-
on software, or additional layers of proprietary code, to keep up. With MongoDB, horizontal scalability across
multiple data centers is a native function.

McAfee - Global Cybersecurity
GTI analyzes cyberthreats from all angles, identifying threat relationships, such as malware used in network intrusions, websites hosting malware,
botnet associations, and more. Threat information is extremely time sensitive; knowing about a threat from weeks ago is useless.
In order to provide up to date, comprehensive threat information, needs to quickly process terabytes of different data types (such as IP address or
domain) into meaningful relationships:
e.g. Is this web site good or bad? What other sites have been interacting with it? The success of the cloud-based system also depends on a bidirectional data flow:
GTI gathers data from millions of client sensors and provides real-time intelligence back to these end products, at a rate of 100 billion queries per month.
Was unable to address these needs and effectively scale out to millions of records with their existing solutions. For example, the HBase / Hadoop setup made it
difficult to run interesting, complex queries, and experienced bugs with the Java garbage collector running out of memory. Another issue was with sharding and
syncing;
Lucene was able to index in interesting ways, but required too much customization.
compensated for all the rebuilding and redeploying of Katta shards with “the usual scripting duct tape,” but what they really needed was a solution that could
seamlessly handle the sharding and updating on its own.
selected MongoDB, which had excellent documentation and a growing community that was “on fire.”

Power JournalismBuzzFeed, the social news and entertainment company, relies on MongoDB to analyze all performance data for its content across
the social web. A core part of BuzzFeed’s publishing platform, MongoDB exposes metrics to editors and writers in real time, to help
them understand how its content is performing and to optimize for the social web. The company has been using MongoDB since
2010. Here’s why.
1.Analytics provide more insight, more quickly. relies on MongoDB for its strategic analytics platform. With apps and dashboards
built on MongoDB, can pinpoint when content is viewed and how it is shared. With this approach, is able to quickly gain insight on
how its content performs, nimbly optimize user’s experience for posts that are performing best and is able to deliver critical
feedback to its writers and editors.
2.BuzzFeed is data-driven. At BuzzFeed, data drives decision-making and powers the company. MongoDB enables to effectively
analyze, track and expose a range of metrics to writers and employees. This includes: the number of clicks; how often and where
posts are being shared; which views on different social media properties lead to the most shares; and how views differ across mobile
and desktop.
3.Successful web journalism demands scale. processes large volumes of data and this is increasing each year as the site’s traffic
continues to grow. Originally built on a relational data store, decided to use MongoDB, a more scalable solution, to collect and track
the data they need with a richer functionality than a standard key-value store.
4.Editors gain edge with access to data in minutes. Fast, easy access to data is critical to helping editors determine what content
will be most shareable in the social media world. With MongoDB, is able to expose performance data shortly after publication,
enabling editors to quickly respond by tweaking headlines and determine the best way to promote.
5.Setting the infrastructure for new applications. As continues its efforts to leverage stats and optimization, MongoDB will feature

Real Time Analytics

More Related Content

What's hot

Viewers also liked

Similar to Real Time Analytics

Recently uploaded

Real Time Analytics