Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb


Published on

BigDataWarehousing by Sabishaw Bhaskaran, Siemens in Information Excellence Session 2013 Feb

Published in: Technology
  • Be the first to comment

Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb

  1. 1. Information Excellence 2013 Feb Knowledge Share Session The best way to put distance between you and the crowd is to do an outstanding job with information. How you gather, manage, and use information will determine whether you win or lose.Information Excellence
  2. 2. Building a BigData warehousing andanalysis system around Apache HadoopSabishaw Bhaskaran
  3. 3. OLTP Used in Operational systems Registers transactions arising out of business workflows Focus is accurate & consistent recording of transactions OS are purpose built systems. E.g. SCM, HR, Financial, Manufacturing Relational, heavily normalized & ACID properties essential Backbone of nearly all business IT systems as we know today Importance is on current data Is about Running the business
  4. 4. But, past transactional data is A trail of how well the business did Can be curated into business insights - kind of “OpIntel” Useful to analyze events, identify trends & make predictions Is valuable for operational, tactical & strategic decision making
  5. 5. OLAP Used in decision support systems Purpose is to display, analyze & discover information Focus is aggregation and fast query response Data handled is typically historical (less updates), less detailed (aggregated) & holistic (integrated) Capable of analyzing multidimensional data interactively from multiple perspectives & can handle ad-hoc queries Basic organization – star/snowflake schemas (or some times also 3NF) Is about Changing the business
  6. 6. ETLOperational data exists in (departmental) silos Extract – Pick what’s relevantOperational systems are purpose built Transform – Ensure syntactic/semantic sanity (+cleansing)We need a enterprise (holistic) view of the business Integrate (Load) – Apply the global schema (EDM*) *Enterprise Data Model
  7. 7. Data Warehouse SCM ETL Data Finance Logic warehouse Operational Systems External Data
  8. 8. A formal definitionData warehouse is a subject-oriented, integrated, time-variant,non-updatable collection of data, used in support ofmanagement decision-making processes*Subject-oriented - The data in the data warehouse is organized so that all the dataelements relating to the ‘same real-world event or object’ (e.g. sales) are linkedtogetherNon-volatile - Data in the data warehouse are never over-written or deleted —once committed, the data are static, read-only, and retained for future reportingIntegrated - The data warehouse contains data from most or all of an organizationsoperational systems and these data are made consistentTime-variant - Values over time are available and hence trends can be observed *By Bill Inmon
  9. 9. A word on BigData Ubiquitous digitization (IT, automation, RFID), Social media Mind-boggling volumes Rapid rate of generation Structured, semi-structured & unstructured 3Vs (Volume, Velocity & Variety) Challenges the capabilities of conventional (RDBMS based) systems (for storage & processing)
  10. 10. MapReduce & Hadoop Source: Hadoop: the Definitive Guide – Tom White
  11. 11. New view on DWHOld school : ETL prior to warehousing (a.k.a schema on write) Data Data ETL sources warehouseNew school : Store first, ETL & analyze when necessary (a.k.a schema on read) Data sources Hadoop ETL & analyze
  12. 12. Apache HiveProjects a relation-oriented structure on the semi-structureddata stored in Hadoop Distributed File System (HDFS)Provides an interface to query the data (in HQL similar toSQL) and translates the query to a plan which consists ofdirected-acyclic graph of map-reduce jobs to be executed byhadoop system in a distributed fashion across the clusterIs an open source data warehousing solution built on hadoopto give analysts the power of using SQL-like language andalso the MapReduce programsSince HQL is very closely related to SQL, a mapping fromHQL to SQL is possible
  13. 13. Our DW system Hive Hadoop
  14. 14. Apache SqoopUsing Hive for analytics and data processing requires loadingdata into clusters and processing it in conjunction with otherdata that often resides in production databases across theenterpriseApache Sqoop is a tool designed for efficiently transferringlarge data between HDFS and structured data stores such asrelational databases (e.g. MS-SQL, MySQL, Oracle)Sqoop successfully graduated from the incubator in March of2012 and is now a top-Level Apache project
  15. 15. Our DW system Streaming sources Hive Sqoop Hadoop Relational sources
  16. 16. Microsoft Power Pivot Microsoft Power Pivot is an add-in to Excel and is an in- memory processing engine that provides multi-dimensional visualization over the data present in Excel Provides functionalities like slicing, dicing, pivoting etc, thus enabling the user with interactive visualizations over the data loaded into PowerPivot Another advantage of Power Pivot is its ability to publish the analysis results online using Microsoft SharePoint along with an interactive interface to users (thus bringing in the self- service feature!)
  17. 17. Our DW system - Final picture Streaming sources Hive Hive ODBC driver Publish results online Sqoop Hadoop Microsoft Microsoft Relational Power Pivot SharePoint sources server
  18. 18. Application - Twitter data analysisSize of twitter data used :Data Rows Size (GB)Follower content (userid of each user with the userid of each of his 1,468,365,132 26followers)Format: *.txtTotal number of users : 40,103,281Tweets data (containing the time, user and content of the tweet) 29,986,960 11Format: *.txtTotal period of tweets : 1 month
  19. 19. Twitter – Most aggressive users
  20. 20. Twitter – Most popular topicsA user-defined function is written in Java to pick the word following each hash-tag in eachtweet (to get the trending topics). This particular user-defined function is used in the query toderive the corresponding total count of each of the word
  21. 21. Twitter – Most popular usersIt was extremely unexpected as none of the top 10 users from this list appears in the listcorresponding to tweet count. So, it is not really the case that people with really highnumber of followers tweet proportionally frequently.And the first user with 3 million followers was also a really astonishing result
  22. 22. Twitter - SpammersThis show that wpstudios and dominiquerdr have more than 99% retweets in the total tweetcount, by which we can classify these users as spammers since their corresponding number oftheir original tweets are significantly very less
  23. 23. Twitter – Activity spread during the day (Re-tweets) This analysis employs the use of the slicer function in Power Pivot
  24. 24. Thank you for your attention
  25. 25. About Information Excellence Group Reach us at: blog: linked in: Excellence-3893869 facebook: excellence-group/171892096247159 presentations: twitter: #infoexcel email: Have you enriched yourself by contributing to the community Knowledge Share..Information Excellence