IEG 201402 INTUIT Building Big Data Analytics Platform


Published on

Information Excellence Group 2014 Spring "Business Analytics Industry Summit", Building Big Data Analytics Platform, Neeta Pande, Data Architect, INTUIT

Published in: Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

IEG 201402 INTUIT Building Big Data Analytics Platform

  1. 1. INTUIT: Neeta Pande Building Big Data Analytics Platform at Intuit
  2. 2. Building Big Data Analytics Platform at Intuit Neeta Pande 8/Feb/2014
  3. 3. Roadmap • Setting Context and Introduction to the Analytics Platform at Intuit • Key highlights that differentiates the platform • Sharing Experiences building the platform • Wish-list of capabilities for future of Big data technologies
  4. 4. Setting Context and Intro to the Analytics Platform
  5. 5. Quick look into Intuit Offerings
  6. 6. Introduction to the Analytical Platform • Central repository of Analytical Data from – – – – Intuit products Intuit Business Systems Intuit Master Systems External Data Sources • Caters to – – – – – Product Managers Product Developers Data Analysts Data Scientists Experience Designers Enterprise Wide Platform for cross Intuit Data Analytics 7
  7. 7. HCATALOG Technologies used to build the platform
  8. 8. Key highlights that differentiates the platform
  9. 9. Capability View of the Platform Management, PM, PD, Data Analyst, Data Scientist Policy based Access Control Central Analytics Platform Near Realtime Batch Realtime Data Integration Product User Entered Data 10 Product Usage Data Business Data Master Data External Data
  10. 10. Key differentiators of the Platform DWH Semantic layers on Hadoop Cohost Sensitive Information on same infrastructure Batch, Near Real Time, Real time on the same infrastructure Mobile, Web, Desktop Offerings Enterprise wide data across all offerings and cross-offerings
  11. 11. Data Pipeline and Challenges • Encryption of sensitive information • Tokenization for join optimization on sensitive fields • Extract Analytical information before encryption • Challenge loading data from transactional sources 3 Data Cleansing 1 Data Acquisition • Cleansing and Standardization need third party libraries • Part of the same flow and need a hadoop integration DWH load 6 7 8 5 4 Data Standardization • DWH patterns like SCD, surrogate key, fact updates challenging Entity Mastering Incremental load Data Securitization 2 • MDM solutions from major vendors do not provide mastering in Hadoop. • Interactive exploration in MPPRDBMS because of Advanced SQL and query performance • Sampling and extraction for building models in R Data Consumption
  12. 12. Sharing Experiences building the platform
  13. 13. Custom Implementation of Mastering solution in-hadoop. • Custom Implementation of symmetric key Encryption/Decryption. • Hadoop does not provide out of the box solution • Leading MDM solutions do not have Hadoop Integration • Evaluated Third Party Solutions, not matured enough • Some open source tools have MDM capabilities, but not matured and widely adopted. • Key management using HSM (Safenet) • Decryption UDFs in MR, PIG, Hive shielding developers/users from the security implementation • Evaluated and found Informatica Data Quality good fit for Data Cleansing and Standardization integrated in the same flow as Batch Data Integration • Batch Data Integration – Evaluated and found Big Data Integration capabilities of Informatica relevant for the Platform • Real time – Using Flume for real time use cases. Found Kafka and storm to be a good fit from several requirements POV. • Traditional DWH and incremental loads challenging on Hadoop. • Upserts and SCD handled best in HBase and exposed via HCatalog for querying The adhoc query capabilities still not matured/adopted and hence MPP-RDBMS still preferred. • Large Scale machine learning infrastructure still being adopted. Hence widely used technology options not in place
  14. 14. Wish-list for future of Hadoop
  15. 15. Data Security support built in to the platform MDM solutions integrated and optimized for the platform Interactive querying capabilities on the big data platforms (Impala, Tez) Better support for traditional DWH capabilities Integrated Real time, Near real time and Batch processing pipelines Distributed machine learning technologies with comprehensive and advanced capabilities Opensource end to end data quality solutions integrated with the platform
  16. 16. Q&A Thank you
  17. 17. About Information Excellence Group Community Focused Volunteer Driven Knowledge Share Accelerated Learning Collective Excellence Distilled Knowledge Shared, Non Conflicting Goals Validation / Brainstorm platform Progress Information Excellence Towards an Enriched Profession, Business and Society Mentor, Guide, Coach Satisfied, Empowered Professional Richer Industry and Academia
  18. 18. About Information Excellence Group Reach us at: blog: presentations: linked in: Facebook: Google+: twitter: email: #infoexcel Have you enriched yourself by contributing to the community Knowledge Share..