Successfully reported this slideshow.

Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar


Published on

  • Be the first to comment

Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

  1. 1. Informatica & Big Data <br />Sanjeev Kumar<br />VP & MD, Informatica India<br />Apache Hadoop India Summit 2011<br />
  2. 2. Agenda<br />Big Data <br />Big Data in Enterprise<br />Informatica & Data<br />Informatica & Big Data<br />
  3. 3. Why “Big Data” Now? : Exploding Data Volumes<br />Complex, Unstructured<br />Relational<br /><ul><li> 2,500 exabytes of new information in 2012 with Internet as primary driver
  4. 4. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year</li></ul>Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. <br />.<br />
  5. 5. Why Now? Exploding Data Volumes<br />Explosion in user-generated content<br />e.g. Blogs, Twitter, Facebook etc.<br />Proliferation of web-connected devices<br />Smartphone interactions with the web<br />Increased consumption of digital content<br />Netflix, HULU, Pandora etc.<br />Internet of things<br />Smart-grid and smart-meters<br />Machine-generated data via the web<br />
  6. 6. Why Now? : New Apps/Use-cases<br />Analyze customer/market sentiment<br />Text analytics on Social Media, blogs<br />Achieve Operational Efficiency<br />e.g. Analyze CDRs to optimize cell tower placements<br />Make Recommendations<br />Data mining on click-stream, purchase history<br />Predict the future<br />e.g. Flightcast predicts flight delays<br />
  7. 7. Big Data Challenges<br />Storage<br />Cost-effective Scalability: to multi-terabytes and petabytes<br />Non-traditional data models: complex, semi-structured data<br />Processing<br />Data mining, collaborative filtering for structured data<br />Text Analytics, classification etc. for unstructured data<br />Regulatory Compliance<br />Data Privacy / Masking<br />Data Archival<br />
  8. 8. Addressing Big Data Challenges<br />Storage<br />Parallel Databases<br />Greenplum(EMC), Vertica, AsterData<br />Distributed Key/Value Stores <br />Hbase, Google’s BigTable, Amazon’s SimpleDB<br />Distributed File Systems<br />HDFS, GFS, ParAccel<br />Analytics<br />SQL with extensions<br />Map Reduce<br />DataFlow Languages : PIG, Sawzall etc<br />
  9. 9. Hadoop Technology Stack<br />Pig<br />Hive<br />Cascading<br />ZooKeeper<br />Map/Reduce<br />HBase<br />HDFS<br />
  10. 10. Hadoop Momentum<br />Job Trends from<br />Search Volume Index<br />News Reference Volume<br />
  11. 11. Big Data in the Enterprise – Hadoop Usage<br />
  12. 12. Big Data in the EnterpriseCase Studies: Hadoop World 2009<br />Yahoo!: Social Graph Analysis<br />VISA: Large Scale Transaction Analysis<br />China Mobile: Data Mining Platform for Telecom Industry<br />JP Morgan Chase: Data Processing for Financial Services<br />eHarmony: Matchmaking in the Hadoop Cloud<br />Rackspace: Cross Data Center Log Processing<br />Visible Technologies: Real-Time Business Intelligence<br />Booz Allen Hamilton: Protein Alignment using Hadoop<br />Slides and Videos at<br />
  13. 13. Big Data in the EnterpriseCase Studies: Hadoop World 2010<br />eBay: Hadoop at eBay<br />Twitter: The Hadoop Ecosystem at Twitter<br />General Electric: Sentiment Analysis powered by Hadoop<br />Yale University: MapReduce and Parallel Database Systems<br />AOL: AOL’s Data Layer<br />Facebook: Hbase in Production <br />Bank of America: The Business of Big Data<br />StumbleUpon: Mixing Real-Time and Batch Processing<br />Raytheon: SHARD: Storing and Querying Large-Scale Data<br />More info at -<br />
  14. 14. Agenda<br />Big Data <br />Big Data in Enterprise<br />Informatica & Data<br />Informatica & Big Data<br />
  15. 15. Informatica – Our Singular Mission Enabling The Information Economy <br /> We enable organizations to gain a competitive advantage from all their information assetsto drive their top business imperatives<br />
  16. 16. Informatica – What We DoComprehensive, Unified, Open and Economical platform<br />Application<br />Partner Data<br />SWIFT<br />NACHA<br />HIPAA<br />…<br />Cloud Computing<br />Unstructured<br />Database<br />Complex<br />Event<br />Processing<br />Data <br />Warehouse<br />Data<br />Migration<br />Test Data<br />Management<br />& Archiving<br />Master Data<br />Management<br />Data <br />Synchronization<br />B2B Data<br />Exchange<br />Data<br />Consolidation<br />UltraMessaging<br />
  17. 17. Informatica & Data<br />Verbs on Data – We do things to data!<br />INFA = Data + [ <br />Archival | As a Service | Cleansing | Clustering | Consolidation | <br />Conversion | De-duping | Exchange | Extraction | Federation | <br />Hub | Identity | Integration | Life-cycle Management | <br />Loading | Masking | Mastering | Matching | Migration | On Demand | <br />Privacy | Profiling | Provisioning | Quality | Quality Assessment | <br />Registry | Replication | Retirement | Services | Stewardship | <br />Sub-setting | Synchronization | Test Management | Transformation | <br />Validation | Virtualization | Warehousing|<br />]<br />
  18. 18. Informatica & Big Data<br />HDFS as a source and a target - Enable universal data connectivity for Hadoop developers<br />Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic <br />Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool<br />Support virtualized access to data split across HDFS and (relational) data-warehouses<br />
  19. 19. Informatica & Hadoop – Big Picture<br />Enterprise <br />Connectivity for <br />Hadoop programs<br />Weblogs<br />Databases<br />BI<br />DW/DM<br />Metadata<br />Repository<br />Graphical IDE for<br />Hadoop Development<br />Semi-structured<br />Un-structured<br />Enterprise Applications<br />Transformation<br />Engine for custom<br />data processing<br />Hadoop Cluster<br />HDFS<br />Job Tracker<br />HDFS<br />Name Node<br />Data Node<br />HDFS<br />