Stream Meets Batch forSmarter AnalyticsW H I T E P A P E RAbstractThis white paper focuses on dealing with Big Data proble...
Stream meets Batch for smarter analytics2Table of ContentsIntroduction.......................................................
Stream meets Batch for smarter analytics3Significant amount of work has been done in the area of historic or batchanalytic...
Stream meets Batch for smarter analytics4quickly analyze, identify and react to the patterns of threats by continuouslypro...
Stream meets Batch for smarter analytics5The Stream Processing SystemThe stream processing system is a completely new para...
Stream meets Batch for smarter analytics6Benefits of stream processing• Online accumulation• Real-time analytics• Live BI ...
Stream meets Batch for smarter analytics7Integration of Archive Data and Live DataAnalysisBoth archive data analysis and l...
Stream meets Batch for smarter analytics8Adaptive AnalysisBoth live data analytics platforms and archive data analytics pl...
Stream meets Batch for smarter analytics9Case Study: Auto Categorize News ArticlesThis section describes how integration c...
Stream meets Batch for smarter analytics10SummaryIn conclusion it can be said that an ideal analytics platform is one whic...
Upcoming SlideShare
Loading in...5
×

Stream Meets Batch for Smarter Analytics- Impetus White Paper

291

Published on

For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper

The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
291
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Stream Meets Batch for Smarter Analytics- Impetus White Paper

  1. 1. Stream Meets Batch forSmarter AnalyticsW H I T E P A P E RAbstractThis white paper focuses on dealing with Big Data problems inreal time. It discusses how the traditional batch paradigm andreal time paradigm can work together to deliver smarter, quickerand better insights on large volumes of data. It also talks aboutadditions to existing solutions to deal with low latency use cases.The paper also guides you on picking the right strategy and righttechnology stack to address real-time, Big Data analyticsproblems.Impetus Technologies, Inc.www.impetus.com
  2. 2. Stream meets Batch for smarter analytics2Table of ContentsIntroduction..............................................................................................2The archive data analytics platform .........................................................3Emerging use cases of Batch processing .....................................3Downsides of Batch processing ...................................................3Live data analytics platform......................................................................3The stream processing system .................................................................5Benefits of stream processing .....................................................6Interesting use cases of live data analytics..................................6Integration of archive data and live data analysis....................................7Smarter Data Ingestion.............................................................................7Adaptive analysis ......................................................................................8Case Study: Auto categorize news articles ..............................................9Summary.................................................................................................10IntroductionEvolution in digital media and technologies has led to an exponential growth inthe volume of data produced by mankind. The data has grown to Exabytes, andis expanding daily. As digital technologies is touching every aspect of our lives,we have data being generated from posts of social media sites, e-mails, digitalpictures, online videos, sensor data for climate information, GPS signals, cellphone data, browsing data, transactional data of online shoppers, etc. Wecategorize such data of data as Big Data.Enterprises are identifying smarter ways to extract valuable information out ofthis data. This valuable information can be used to predict market trends,optimize business processes, create effective campaigns and improve userservices. Analysis of the data can fall into two broad classes– real-time andhistoric. Together, both kinds of analysis provide a 360 degree view of valuableinformation.
  3. 3. Stream meets Batch for smarter analytics3Significant amount of work has been done in the area of historic or batchanalytics, with the evolution of solutions built over the Hadoop platform orsimilar platforms like R Analytics and HPCC. However, enterprises are laggingbehind in the area of real-time analysis of Big Data and even more on thecombination of the two. The real challenge is dealing with historic information,to find insights, and smarter ways to effectively use those insights with real-timedata. |This paper primarily focuses on Big Data real-time processing strategies toenable existing platforms to handle low latency use cases. It empowersbusinesses to gain quick insight and in turn maximize Return on Investment(ROI).The Archive Data Analytics PlatformBatch or archive data processing is the most widely used approach for analyzingbig volumes of data. In batch processing, the data is aggregated into a singleentity called the batch or a job. The biggest covet in batching is that it won’t giveyou a partial result of the analysis. For results, you have to wait until the batchprocessing is done. Batch analysis is best suited for deeper analysis of datawhich requires full view of the data. Consider an e-commerce web site, wherethe requirement is to recommend to users the products of their taste, tomaximize sales.Emerging use cases of Batch processingDeeper analyticsClassification of dataClustering of dataRecommendations on user tastesDownsides of Batch processingHigh latency resultsClassification of dataBigger hardware requirementsLimited ad-hoc capabilitiesLive Data Analytics PlatformTime is the key. Analytics solutions for domains like defense, credit card frauddetection, intelligence, law enforcement, online trading and security need to
  4. 4. Stream meets Batch for smarter analytics4quickly analyze, identify and react to the patterns of threats by continuouslyprocessing the enormous amounts of data generated from network logs, e-mails, social media feeds, sensor data, web feeds and many other sources. Forsuch applications, timely response is the only key to their business. Otherwisehigh latency information is of no use.Enterprises need a revolutionary upgrade in their capabilities to extract,transform, analyze and quickly respond to the huge volume of data coming inreal time. Today, many enterprises are struggling to manage and analyzemassive and growing volumes of data in real time.Lately, few technologies and tools have emerged to meet the challenges ofanalyzing high volumes of data in real time or near real time. This section talksabout a few of the existing approaches with their downsides:1. Relational Database management systems: RDBMSs have beenavailable for years for OLTP as well as data warehouse class ofapplications. But they do not scale and perform for high volumestreaming data because of indexing limitations.2. Main memory databases: Modified versions of DBMSs target the sameset of functionalities as traditional DBMSs but with higher throughput,by storing data in the main memory instead of physical storage. Liketraditional systems, they also fail when it comes to Big Datarequirements.3. Rule engines: Sales and marketing has ‘repurposed’ them to deal withBig Data real time applications. Their downside is a lack of suitablestorage systems and hence the need for different infrastructure forpersistence of data.
  5. 5. Stream meets Batch for smarter analytics5The Stream Processing SystemThe stream processing system is a completely new paradigm well suited forhandling continuous data. It offers high scalability, performance and flexibilityover other traditional approaches. Conventional systems run continuous queriesover stored static data whereas a stream processing system runs static queriesover continuous unbounded data. A stream processing system for continuousunbounded data is analogous to a DBMS for structural stored data.The stream processing platform consists of three major components: dataimport connectors, output connectors and ETL components. Various types ofincoming data from multiple sources are pulled into the platform using inputconnectors. In the next stage, the data is cleansed, filtered, transformed,clustered, classified or correlated and the resulting information used fornotifications, reporting and analyses.
  6. 6. Stream meets Batch for smarter analytics6Benefits of stream processing• Online accumulation• Real-time analytics• Live BI competences• Smart ingestion into data warehouse (details in next section)Interesting use cases of live data analyticsFraud detection– Analysis on millions of real time credit card transactions todetect and prevent any fraud cases using predictive algorithms. Also, text ininsurance claim documents can be analyzed to identify probable fraud cases.Patient health monitoring–An analytical solution can capture streams of datacoming from medical equipment that monitors a patient’s heart rate, bloodpressure, sugar levels and temperature and predict if an infection orcompilation can occur.Omni channel retail – Data from various independent sources can be analyzedto enhance the shoppers’ experience by recommending products, customizedcampaigns, and location based offerings.
  7. 7. Stream meets Batch for smarter analytics7Integration of Archive Data and Live DataAnalysisBoth archive data analysis and live data analysis can handle their own class ofuse cases, and they complement each other. At times, enterprises require closeintegration of both platforms to get a full 360 degree view of the information.This section focuses on the benefits of integrating these two classes ofplatforms.Smarter Data IngestionRecently, an interesting trend has been found in Big Data repositories. Lots ofdata stored in data a warehouse is of very little or no business use and willnever appear in business reports. It is also stated to be a ‘Big Data fetish’problem. To overcome this problem, it is essential to identify what is to bestored and store what’s relevant to the business.Streaming systems can be used to address the Big Data fetish problem. Datacoming from various data sources can be cleansed, extracted, transformed,filtered and normalized in the streaming system. Processed data then can bepersisted in data warehouses for deeper analytics. This approach will reduce theoverall cost of data storage by a significant amount.An example can be viewed in e-mail or SMS processing use cases such asLawyers.com. In this use case, lots of storage optimization can be achieved byidentifying spam and corrupted messages before dumping them into the datastore. This can be achieved using streams.
  8. 8. Stream meets Batch for smarter analytics8Adaptive AnalysisBoth live data analytics platforms and archive data analytics platforms canexchange data between. They can also be used smartly to exchange or shareintelligence. This will help improve the effectiveness, accuracy and quality ofanalysis by absorbing these intelligences. Exchange of intelligence can beachieved in two ways:1. Archive to live exchange: Deeper analytics algorithms such asRecommendation, Classification, Clustering, Statistical and patternfinding algorithms are applied over huge volume of data accumulatedover long periods of time. For instance, classification model generationover historic e-mails or finding item similarity models forrecommendations. The generated model can later be utilized by acorresponding component in streams to identify quick, real time insightover continuous data. For instance, incoming e-mail or a documentstream can be classified or categorized in real time. In this scenario,deeper analysis is helping live streams in decision making.2. Live to archive exchange: The stream validates unbounded incomingdata using models generated by the batch processing platform. In case,conflicts goes beyond a threshold, a level stream processing platformcan signal to the batch platform that it is time to update or rebuild anew model. For instance, if we are categorizing incoming documents onthe Wikipedia categorization model and if the percentage of the defaultcategory or unidentified category goes beyond a threshold level, thenstreams can signal to the batch processing platform to re-build thecategorization model using a new set of documents. In this scenario, theachieved platform is assisted by the live platform for better quality ofanalytics.
  9. 9. Stream meets Batch for smarter analytics9Case Study: Auto Categorize News ArticlesThis section describes how integration concepts explained in the above can beapplied in real world use cases. Consider an example of auto categorization ofnew article streams or feeds coming from different data sources. A flowdiagram of this use cases is shown below:Incoming new article streams and feeds are first cleansed and parsed to extractmeaningful data, with the garbage data getting thrown off. In the second stage,the extracted data is pushed into the batch processing platform for deeperanalytics. At the same time, the stream processing platform categorizes the newarticles using the model generated by the batch processing platform. If thepercentage of a default or unknown category crosses a threshold limit, thestream platform can trigger or ask the batch platform to re-generate a newmodel. Once the batch platform is done with model generation, the updatedmodel is pushed to the stream platform to start categorizing documents in realtime again.
  10. 10. Stream meets Batch for smarter analytics10SummaryIn conclusion it can be said that an ideal analytics platform is one which cansupport offline analytics as well as online or real time analytics with equal ease.These are two completely different paradigms which not only complement eachother but assist each other for effective analytics. Together, they can provideeffective, quick and 360 degree insight into large data. Having this integrationstrategy in place can empower the platform to target almost any type of usecases.This paper describes different integration points where these paradigms caninteract with each other for delivering smart, quick and complete analytics overBig Data.About ImpetusImpetus Technologies is a leading provider of Big Data solutions for theFortune 500®. We help customers effectively manage the “3-Vs” of Big Dataand create new business insights across their enterprises.Website: www.bigdata.impetus.com | Email: bigdata@impetus.com© 2013 Impetus Technologies,Inc. All rights reserved. Productand company names mentionedherein may be trademarks oftheir respective companies.May 2013

×