Hadoop India Summit, Feb 2011 - Informatica


Published on

Lightening talk by Sanjeev Kumar, from Informatica India. Presented at the Hadoop India Summit on Feb 16, 2011.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Map/Reduce implementation Apache Open Source Project : Yahoo dominated Two major components HDFS Failure Resilient Distributed File Systems Map/Reduce Failure Resilient Distributed Computing Framework Scales to thousand+ node cluster Used by Yahoo, Facebook etc
  • This is the Informatica Corporate Presentation
  • Hadoop India Summit, Feb 2011 - Informatica

    1. 1. Informatica & Big Data Sanjeev Kumar VP & MD, Informatica India Apache Hadoop India Summit 2011
    2. 2. Agenda <ul><li>Big Data </li></ul><ul><li>Big Data in Enterprise </li></ul><ul><li>Informatica & Data </li></ul><ul><li>Informatica & Big Data </li></ul>
    3. 3. Why “Big Data” Now? : Exploding Data Volumes Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. . Relational Complex, Unstructured <ul><li>2,500 exabytes of new information in 2012 with Internet as primary driver </li></ul><ul><li>Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year </li></ul>
    4. 4. Why Now? Exploding Data Volumes <ul><li>Explosion in user-generated content </li></ul><ul><ul><li>e.g. Blogs, Twitter, Facebook etc. </li></ul></ul><ul><li>Proliferation of web-connected devices </li></ul><ul><ul><li>Smartphone interactions with the web </li></ul></ul><ul><li>Increased consumption of digital content </li></ul><ul><ul><li>Netflix, HULU, Pandora etc. </li></ul></ul><ul><li>Internet of things </li></ul><ul><ul><li>Smart-grid and smart-meters </li></ul></ul><ul><ul><li>Machine-generated data via the web </li></ul></ul>
    5. 5. Why Now? : New Apps/Use-cases <ul><li>Analyze customer/market sentiment </li></ul><ul><ul><li>Text analytics on Social Media, blogs </li></ul></ul><ul><li>Achieve Operational Efficiency </li></ul><ul><ul><li>e.g. Analyze CDRs to optimize cell tower placements </li></ul></ul><ul><li>Make Recommendations </li></ul><ul><ul><li>Data mining on click-stream, purchase history </li></ul></ul><ul><li>Predict the future </li></ul><ul><ul><li>e.g. Flightcast predicts flight delays </li></ul></ul>
    6. 6. Big Data Challenges <ul><li>Storage </li></ul><ul><ul><li>Cost-effective Scalability: to multi-terabytes and petabytes </li></ul></ul><ul><ul><li>Non-traditional data models: complex, semi-structured data </li></ul></ul><ul><li>Processing </li></ul><ul><ul><li>Data mining, collaborative filtering for structured data </li></ul></ul><ul><ul><li>Text Analytics, classification etc. for unstructured data </li></ul></ul><ul><li>Regulatory Compliance </li></ul><ul><ul><li>Data Privacy / Masking </li></ul></ul><ul><ul><li>Data Archival </li></ul></ul>
    7. 7. Addressing Big Data Challenges <ul><li>Storage </li></ul><ul><ul><li>Parallel Databases </li></ul></ul><ul><ul><ul><li>Greenplum(EMC), Vertica, AsterData </li></ul></ul></ul><ul><ul><li>Distributed Key/Value Stores </li></ul></ul><ul><ul><ul><li>Hbase, Google’s BigTable, Amazon’s SimpleDB </li></ul></ul></ul><ul><ul><li>Distributed File Systems </li></ul></ul><ul><ul><ul><li>HDFS, GFS, ParAccel </li></ul></ul></ul><ul><li>Analytics </li></ul><ul><ul><li>SQL with extensions </li></ul></ul><ul><ul><li>Map Reduce </li></ul></ul><ul><ul><li>DataFlow Languages : PIG, Sawzall etc </li></ul></ul>
    8. 8. Hadoop Technology Stack HDFS HBase Map/Reduce Pig Hive Cascading
    9. 9. Hadoop Momentum Search Volume Index News Reference Volume Job Trends from Indeed.com
    10. 10. Big Data in the Enterprise – Hadoop Usage
    11. 11. Big Data in the Enterprise Case Studies: Hadoop World 2009 <ul><li>Yahoo!: Social Graph Analysis </li></ul><ul><li>VISA : Large Scale Transaction Analysis </li></ul><ul><li>China Mobile : Data Mining Platform for Telecom Industry </li></ul><ul><li>JP Morgan Chase : Data Processing for Financial Services </li></ul><ul><li>eHarmony : Matchmaking in the Hadoop Cloud </li></ul><ul><li>Rackspace: Cross Data Center Log Processing </li></ul><ul><li>Visible Technologies: Real-Time Business Intelligence </li></ul><ul><li>Booz Allen Hamilton : Protein Alignment using Hadoop </li></ul><ul><li>Slides and Videos at http://www.cloudera.com/hadoop-world-nyc </li></ul>
    12. 12. <ul><li>eBay: Hadoop at eBay </li></ul><ul><li>Twitter: The Hadoop Ecosystem at Twitter </li></ul><ul><li>General Electric : Sentiment Analysis powered by Hadoop </li></ul><ul><li>Yale University : MapReduce and Parallel Database Systems </li></ul><ul><li>AOL: AOL’s Data Layer </li></ul><ul><li>Facebook: Hbase in Production </li></ul><ul><li>Bank of America: The Business of Big Data </li></ul><ul><li>StumbleUpon: Mixing Real-Time and Batch Processing </li></ul><ul><li>Raytheon : SHARD: Storing and Querying Large-Scale Data </li></ul><ul><li>More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc / </li></ul>Big Data in the Enterprise Case Studies: Hadoop World 2010
    13. 13. Agenda <ul><li>Big Data </li></ul><ul><li>Big Data in Enterprise </li></ul><ul><li>Informatica & Data </li></ul><ul><li>Informatica & Big Data </li></ul>
    14. 14. <ul><li>We enable organizations to gain a competitive advantage from all their information assets to drive their top business imperatives </li></ul>Informatica – Our Singular Mission Enabling The Information Economy
    15. 15. Informatica – What We Do Comprehensive, Unified, Open and Economical platform Data Warehouse Data Migration Test Data Management & Archiving Master Data Management Data Synchronization B2B Data Exchange Data Consolidation Complex Event Processing Ultra Messaging Application Partner Data SWIFT NACHA HIPAA … Cloud Computing Unstructured Database
    16. 16. <ul><li>INFA = Data + [ </li></ul><ul><ul><li>Archival | As a Service | Cleansing | Clustering | Consolidation | </li></ul></ul><ul><ul><li>Conversion | De-duping | Exchange | Extraction | Federation | </li></ul></ul><ul><ul><li>Hub | Identity | Integration | Life-cycle Management | </li></ul></ul><ul><ul><li>Loading | Masking | Mastering | Matching | Migration | On Demand | </li></ul></ul><ul><ul><li>Privacy | Profiling | Provisioning | Quality | Quality Assessment | </li></ul></ul><ul><ul><li>Registry | Replication | Retirement | Services | Stewardship | </li></ul></ul><ul><ul><li>Sub-setting | Synchronization | Test Management | Transformation | </li></ul></ul><ul><ul><li>Validation | Virtualization | Warehousing | </li></ul></ul><ul><li>] </li></ul>Informatica & Data Verbs on Data – We do things to data!
    17. 17. Informatica & Big Data <ul><li>HDFS as a source and a target - Enable universal data connectivity for Hadoop developers </li></ul><ul><li>Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic </li></ul><ul><li>Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool </li></ul><ul><li>Support virtualized access to data split across HDFS and (relational) data-warehouses </li></ul>
    18. 18. Informatica & Hadoop – Big Picture HDFS Data Node HDFS Name Node HDFS Job Tracker Hadoop Cluster Transformation Engine for custom data processing Weblogs Enterprise Applications Databases Semi-structured Un-structured BI DW/DM Metadata Repository Graphical IDE for Hadoop Development Enterprise Connectivity for Hadoop programs