• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb
 

Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb

on

  • 479 views

BigDataWarehousing by Sabishaw Bhaskaran, Siemens in Information Excellence Session 2013 Feb

BigDataWarehousing by Sabishaw Bhaskaran, Siemens in Information Excellence Session 2013 Feb

Statistics

Views

Total Views
479
Views on SlideShare
479
Embed Views
0

Actions

Likes
1
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb Presentation Transcript

    • Information Excellence 2013 Feb Knowledge Share Session The best way to put distance between you and the crowd is to do an outstanding job with information. How you gather, manage, and use information will determine whether you win or lose.Information Excellence informationexcellence.wordpress.com
    • Building a BigData warehousing andanalysis system around Apache HadoopSabishaw Bhaskaran
    • OLTP Used in Operational systems Registers transactions arising out of business workflows Focus is accurate & consistent recording of transactions OS are purpose built systems. E.g. SCM, HR, Financial, Manufacturing Relational, heavily normalized & ACID properties essential Backbone of nearly all business IT systems as we know today Importance is on current data Is about Running the business
    • But, past transactional data is A trail of how well the business did Can be curated into business insights - kind of “OpIntel” Useful to analyze events, identify trends & make predictions Is valuable for operational, tactical & strategic decision making
    • OLAP Used in decision support systems Purpose is to display, analyze & discover information Focus is aggregation and fast query response Data handled is typically historical (less updates), less detailed (aggregated) & holistic (integrated) Capable of analyzing multidimensional data interactively from multiple perspectives & can handle ad-hoc queries Basic organization – star/snowflake schemas (or some times also 3NF) Is about Changing the business
    • ETLOperational data exists in (departmental) silos Extract – Pick what’s relevantOperational systems are purpose built Transform – Ensure syntactic/semantic sanity (+cleansing)We need a enterprise (holistic) view of the business Integrate (Load) – Apply the global schema (EDM*) *Enterprise Data Model
    • Data Warehouse SCM ETL Data Finance Logic warehouse Operational Systems External Data
    • A formal definitionData warehouse is a subject-oriented, integrated, time-variant,non-updatable collection of data, used in support ofmanagement decision-making processes*Subject-oriented - The data in the data warehouse is organized so that all the dataelements relating to the ‘same real-world event or object’ (e.g. sales) are linkedtogetherNon-volatile - Data in the data warehouse are never over-written or deleted —once committed, the data are static, read-only, and retained for future reportingIntegrated - The data warehouse contains data from most or all of an organizationsoperational systems and these data are made consistentTime-variant - Values over time are available and hence trends can be observed *By Bill Inmon
    • A word on BigData Ubiquitous digitization (IT, automation, RFID), Social media Mind-boggling volumes Rapid rate of generation Structured, semi-structured & unstructured 3Vs (Volume, Velocity & Variety) Challenges the capabilities of conventional (RDBMS based) systems (for storage & processing)
    • MapReduce & Hadoop Source: Hadoop: the Definitive Guide – Tom White
    • New view on DWHOld school : ETL prior to warehousing (a.k.a schema on write) Data Data ETL sources warehouseNew school : Store first, ETL & analyze when necessary (a.k.a schema on read) Data sources Hadoop ETL & analyze
    • Apache HiveProjects a relation-oriented structure on the semi-structureddata stored in Hadoop Distributed File System (HDFS)Provides an interface to query the data (in HQL similar toSQL) and translates the query to a plan which consists ofdirected-acyclic graph of map-reduce jobs to be executed byhadoop system in a distributed fashion across the clusterIs an open source data warehousing solution built on hadoopto give analysts the power of using SQL-like language andalso the MapReduce programsSince HQL is very closely related to SQL, a mapping fromHQL to SQL is possible
    • Our DW system Hive Hadoop
    • Apache SqoopUsing Hive for analytics and data processing requires loadingdata into clusters and processing it in conjunction with otherdata that often resides in production databases across theenterpriseApache Sqoop is a tool designed for efficiently transferringlarge data between HDFS and structured data stores such asrelational databases (e.g. MS-SQL, MySQL, Oracle)Sqoop successfully graduated from the incubator in March of2012 and is now a top-Level Apache project
    • Our DW system Streaming sources Hive Sqoop Hadoop Relational sources
    • Microsoft Power Pivot Microsoft Power Pivot is an add-in to Excel and is an in- memory processing engine that provides multi-dimensional visualization over the data present in Excel Provides functionalities like slicing, dicing, pivoting etc, thus enabling the user with interactive visualizations over the data loaded into PowerPivot Another advantage of Power Pivot is its ability to publish the analysis results online using Microsoft SharePoint along with an interactive interface to users (thus bringing in the self- service feature!)
    • Our DW system - Final picture Streaming sources Hive Hive ODBC driver Publish results online Sqoop Hadoop Microsoft Microsoft Relational Power Pivot SharePoint sources server
    • Application - Twitter data analysisSize of twitter data used :Data Rows Size (GB)Follower content (userid of each user with the userid of each of his 1,468,365,132 26followers)Format: *.txtTotal number of users : 40,103,281Tweets data (containing the time, user and content of the tweet) 29,986,960 11Format: *.txtTotal period of tweets : 1 month
    • Twitter – Most aggressive users
    • Twitter – Most popular topicsA user-defined function is written in Java to pick the word following each hash-tag in eachtweet (to get the trending topics). This particular user-defined function is used in the query toderive the corresponding total count of each of the word
    • Twitter – Most popular usersIt was extremely unexpected as none of the top 10 users from this list appears in the listcorresponding to tweet count. So, it is not really the case that people with really highnumber of followers tweet proportionally frequently.And the first user with 3 million followers was also a really astonishing result
    • Twitter - SpammersThis show that wpstudios and dominiquerdr have more than 99% retweets in the total tweetcount, by which we can classify these users as spammers since their corresponding number oftheir original tweets are significantly very less
    • Twitter – Activity spread during the day (Re-tweets) This analysis employs the use of the slicer function in Power Pivot
    • Thank you for your attention
    • About Information Excellence Group Reach us at: blog: http://informationexcellence.wordpress.com/ linked in: http://www.linkedin.com/groups/Information- Excellence-3893869 facebook: http://www.facebook.com/pages/Information- excellence-group/171892096247159 presentations: http://www.slideshare.net/informationexcellence twitter: #infoexcel email: informationexcellence@compegence.com informationexcellencegroup@gmail.com Have you enriched yourself by contributing to the community Knowledge Share..Information Excellence informationexcellence.wordpress.com