• Save
Deriving Intelligence from Large Data - Hadoop implementation and Applying Analytics
Upcoming SlideShare
Loading in...5
×
 

Deriving Intelligence from Large Data - Hadoop implementation and Applying Analytics

on

  • 1,941 views

Hadoop consultance and Implementation

Hadoop consultance and Implementation

Statistics

Views

Total Views
1,941
Views on SlideShare
1,941
Embed Views
0

Actions

Likes
1
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Examples of BI and Analytics over Structured/Semi Structured Data - Financial Data Monitoring - Social N/W Data - Click Stream Data - Server logs data for fault detection Some kind of Analysis over that data: - Sentimental Analysis - Predictive AnalysisScript: In today's world, we see a data deluge everywhere. Being surrounded with data, obviously call for numerous mechanisms of expecting lot of returns that can be generated from that data. The limit for ideas about its usage is unlimited. Let us see some real world examples:Financial Data Monitoring: We all know that today billions/trillions of financial transactions are being done every hour. Now for an organization that provides a fraud detection system over the financial data, it will almost be impossible to detect suspected looking transaction over huge amount of data that is being generated. Now this is not something that can be done only on a small subset of data being generated in a small span of time, say a week or a month. It might be necessary to process the data that spans over a period of months or even years and it could also include data coming in from various distributed geographical locations as well. Over and above that it is possible that each and every byte being recorded in such a system needs to be monitored, analyzed and processed so Hadoop eco systems provides an excellent framework for doing that processing in distributed manner on commodity hardware in a very cost effective manner.Once data is processed and summarized to its miniscule possible levels, the summarized information will itself contains huge data that will be needed to be persisted and processed over and over again. Various mechanisms like generating alerts on any eyebrow raising transactions will be required by such a system. There is where BI and analytics comes in. Sentimental Analysis:Today lots of organizations are driving their businesses based on the feeling of their customers and potential customers. Every organization wants to know about the emotions associated with an idea, product and services being offered by them. And there is no dearth of this information, obviously in form of unstructured data like tweets, comments, posting, blogs etc. NLP based applications can be used to gauge the sentiments of the customers and identify their compliments and grievances using which organizations can decide the course of their business roadmaps.This also helps in doing predictive analysis by identifying the patterns in the customer behaviors as well as their preferences.Every business domain can list down numerous ways of using the data that they already have or want to gather for these benefits and I am sure this discussion will always re-iterate about the need that data can be used to derive more business opportunities?
  • Let us see some of the criteria for choosing the right BI Strategy for your organization: Over all ease & Cost of Implementation:Gathering the right information from the data can be very expensive. So choosing the right BI solution for your business needs will definitely impact the IT budgets if not thought over prudently. Another consideration to make about the solution being chosen is the ease with it canbe implemented and start giving returns. It could be disastrous to choose a not so expensive but difficult to use BI solution as it can increase the overall cost of implementation drastically in an indirect manner. The indirect cost in terms of training about BI and its adoption could itself doom the complete implementation and can become a major challenge to cope up with in future. Hence choosing the right BI strategy is critically important from the cost perspective.Real Time Analysis vs. Batch Analysis:One needs to identify whether the data that is being gathered and stored needs to be analyzed in real time manner using High touch queries oris it required to be processed in batches that can use Hadoop like framework to process and produce results in an offline manner. Basically, When we talk about large data, it is generally captured to be processed later to derive some useful information out of it using efficient parallel processing techniques like Hadoop. Once the data is structured it can then be used for BI and analytics. Later during the presentation we will talk what are the best suited approaches for doing real time or batch analytics over the summarized data. End User Ad hoc Analytics:Today everyone is smart enough to define their own requirements and needs. Service providers with a reactive approach to those needs can only play a catch up role rather than acting as innovators. Now by providing ad hoc analytics to your end users gives u that competitive advantage where in your smart customers can get what they need from that data instead of coming back to you for the information that might not be very useful generically but of utmost importance to one of them. This can also provide you with excellent gamut of ideas over the gathered information which would have been difficult to comprehend on individual basis. Structured/Semi Structured/Un-Structured Data:One has to gauge about what would be the best strategy to store large data based on its size and usage. You can either store it in Unstructured, Semi Structured or Completely structured manner. Sometimes the organizations are not even aware of what kind for structure will actually be needed for analytics while capturing and storing the data. In those cases, it might also be possible that you choose to go with a hybrid kind of an approach to get the best of all.Data present in its various forms (Structured/Unstructured) might need to be interfaced at some level if not always. Historical data which is generally preferred to be stored in Semi structured or unstructured form like Web logs etc. might be used to combine with structured data stored in relational form to give you more business insight. Complex Computations:BI generally involves complex computations at both levels when dealing with large data. First Level is when you want to covert the large data into structured and summary form and other level is when you convert that summary data into information using BI solution. Hence it also place a role when you work on formalizing your BI strategy.Other challenges that one has to consider are like Data Security, Scalability, and Accessibility and Fault tolerance.Based on your business needs one has to weigh each of these challenges and identify the right BI strategy for themselves. In next few minutes, we will talk about various popular BI approaches that are generally being used for achieving BI implementations over large data.Now Before I move forward to those approaches, Let me list them down for you.Analytics over Hadoop with MPP DWIndirect Analytics over HadoopDirect Analytics over Hadoop
  • Today lot of options are available in market that allows integration of MPP DWs with Hadoop. This option is worth considering if you have large amount of data even after applying summarization over it. Using Hadoop for cleaning/transforming the data into a structured form allows you to load data into any of the available options of MPP DWs. Once the data is loading you can write UDFs to perform Database level analytics and then integrate it with BI solutions using ODBC/JDBC connectivity for end user analytics and reporting.Also, using MPP DWs will allow you use various performance enhancement techniques like index compression, materialized views, result set caching and I/O sharing.Optionally, some of them also provide you with good framework that supports MR jobs executions on their own clusters at MPP levels providing you with 2 levels of parallel processing. These kind of features are really good for working with High touch queries and provides an excellent framework for End User Ad hoc analytics as well.The down part of this kind of approach is that it could be really expensive to get started. Most of the MPP DWs are too expensive to acquire and some MPP DWs also expect high end servers to be used for deployment which could also be very expensive. Also, using this approach which needs expert teams that have good hands on experience on these MPP DWs management and development which could itself be a challenge in today's fast changing environment where open source technologies like Hadoop are moving very fast in terms of their adoptability and increasing flexibility.
  • Another interesting approach could be to use Hadoop for cleaning/transforming the data into a structured form and then loading the data into RDBMS databases. Hadoop comes with excellent support for accessing data between RDBMS data sources and Hadoop systems thru DBInputFormat and DBOutputFormat interfaces. Once the Unstructured Data is processed and then it can be pushed to an RDBMS database which then can act as a data source for any BI solution.This approach provides the end user with the flexibility of parallel processing of Hadoop and Flexibility of a SQL interface to the summarized data. It’s pretty good when the summarized data is not too big to pose a challenge for the RDBMS database being used. This solution is not as expensive as the first approach that we have discussed shortly.This approach is equally good for high touch queries where the user wants to do real time ad hoc analytics as most of the RDBMS databases these days come with a comprehensive set of performance enhancement techniques.But when the summarized data is itself huge these approach might fail to deliver. Also, if batch analysis is the key requirement, then moving data to an RDBMS database could be a redundant activity. What if one could do BI and Analytics directly over present on a Hadoop System directly without moving it to any RDBMS databases? Let us see what our next option is that precisely talks about this approach.
  • Now let’s take a brief example about how it could be very useful to use data directly from the Hadoop file system for reporting. Say we have a scenario where the processed & summarized data is itself quite huge and placed on the Hadoop system. Now we want to use the summarized data for batch reporting without getting into the complications of moving the data out of the Hadoop system either to a MPP DW or RDBMS. Doing this is could be made possible by using Hive as an interface for the data present on Hadoop systems.As previously discussed by Saurabh, Hive provides a very promising interface for executing SQL queries by converting them into MR jobs that are themselves executed on Hadoop clusters for the data that is itself present on Hadoop. This approach allows you to do batch and asynchronous analytics over the same data present over the Hadoop system. It is also very cost effective as it does not involves costs of managing the separates data source other than your existing Hadoop System and also provides you with the flexibility of scaling to any level with your summarized data as well. At Intellicus, we believe that all the three approaches that we have just discussed have their own advantages and disadvantages. It would not be prudent to advocate any one approach against any other as the users themselves will be wise enough to choose their own BI strategy based on their business needs.
  • Now let us see what are the factors that need to be considered while deciding the BI strategy for extracting business insights from Large Data.Structure of the DataOne of the most crucial technical factor is whether my data is structured or not.Putting in other words is it possible to derive the structure out of data and is it possible at the time of data generation and loading.May be extracting the structure from the data requires complex transformations which are not possible at the time of data generation and loading.Sometimes the organizations are not even aware of what kind of structure will actually be needed for analytics while capturing and storing the data.Real Time Analysis or Batch AnalysisOne needs to consider whether the batch analysis will suffice or they need real time analysis.End User Adhoc AnalyticsOne of the most important factor to consider is to see if the solution will allow the end users to perform the adhoc analytics overthe large data. Can the data be presented in the form of business entities so that CXOs are able to query what they wantAnd they dont have to rely on IT department to design the reports for them.

Deriving Intelligence from Large Data - Hadoop implementation and Applying Analytics Deriving Intelligence from Large Data - Hadoop implementation and Applying Analytics Presentation Transcript

  • Deriving Intelligence from Large Data Using Hadoop and Applying Analytics
    By
    Impetus Technologies
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Agenda
    Large Scale Data
    Storing, Transforming and Analyzing
    Using Hadoop for ETL
    Case Study
    Large Scale Analytics - Why
    Challenges and Approaches
    The Road Ahead
    The Large Scale BI Strategy
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Large Scale Data
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Storing Large Data
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Transforming Large Data
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Analytics on Large Scale Data
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Our Solution
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • HIVE
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • How HIVE Helps?
  • Case Study
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • How it Works?
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Large Scale Analytics: Why?
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    www.intellicus.com
  • Analytics: Challenges
    www.intellicus.com
  • Analytics over Hadoop with MPP DW
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    www.intellicus.com
  • Indirect Analytics over Hadoop
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    www.intellicus.com
  • Direct Analytics over Hadoop
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    www.intellicus.com
  • The Road Ahead..
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    www.intellicus.com
  • The Large Scale BI Strategy
    www.intellicus.com
  • Impetus Technologies
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Questions
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
  • Thank you
    For more information, write to us atinquiry@impetus.comor visit www.impetus.com
    Recorded version available at:
    http://www.impetus.com/webinar_registration?event=archived&eid=27
    Knowledge Partner
    Email- info@intellicus.com
    Visit - www.intellicus.com