Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar


Published on

  • Be the first to comment

Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

  1. 1. Informatica & Big Data <br />Sanjeev Kumar<br />VP & MD, Informatica India<br />Apache Hadoop India Summit 2011<br />
  2. 2. Agenda<br />Big Data <br />Big Data in Enterprise<br />Informatica & Data<br />Informatica & Big Data<br />
  3. 3. Why “Big Data” Now? : Exploding Data Volumes<br />Complex, Unstructured<br />Relational<br /><ul><li> 2,500 exabytes of new information in 2012 with Internet as primary driver
  4. 4. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year</li></ul>Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. <br />.<br />
  5. 5. Why Now? Exploding Data Volumes<br />Explosion in user-generated content<br />e.g. Blogs, Twitter, Facebook etc.<br />Proliferation of web-connected devices<br />Smartphone interactions with the web<br />Increased consumption of digital content<br />Netflix, HULU, Pandora etc.<br />Internet of things<br />Smart-grid and smart-meters<br />Machine-generated data via the web<br />
  6. 6. Why Now? : New Apps/Use-cases<br />Analyze customer/market sentiment<br />Text analytics on Social Media, blogs<br />Achieve Operational Efficiency<br />e.g. Analyze CDRs to optimize cell tower placements<br />Make Recommendations<br />Data mining on click-stream, purchase history<br />Predict the future<br />e.g. Flightcast predicts flight delays<br />
  7. 7. Big Data Challenges<br />Storage<br />Cost-effective Scalability: to multi-terabytes and petabytes<br />Non-traditional data models: complex, semi-structured data<br />Processing<br />Data mining, collaborative filtering for structured data<br />Text Analytics, classification etc. for unstructured data<br />Regulatory Compliance<br />Data Privacy / Masking<br />Data Archival<br />
  8. 8. Addressing Big Data Challenges<br />Storage<br />Parallel Databases<br />Greenplum(EMC), Vertica, AsterData<br />Distributed Key/Value Stores <br />Hbase, Google’s BigTable, Amazon’s SimpleDB<br />Distributed File Systems<br />HDFS, GFS, ParAccel<br />Analytics<br />SQL with extensions<br />Map Reduce<br />DataFlow Languages : PIG, Sawzall etc<br />
  9. 9. Hadoop Technology Stack<br />Pig<br />Hive<br />Cascading<br />ZooKeeper<br />Map/Reduce<br />HBase<br />HDFS<br />
  10. 10. Hadoop Momentum<br />Job Trends from<br />Search Volume Index<br />News Reference Volume<br />
  11. 11. Big Data in the Enterprise – Hadoop Usage<br />
  12. 12. Big Data in the EnterpriseCase Studies: Hadoop World 2009<br />Yahoo!: Social Graph Analysis<br />VISA: Large Scale Transaction Analysis<br />China Mobile: Data Mining Platform for Telecom Industry<br />JP Morgan Chase: Data Processing for Financial Services<br />eHarmony: Matchmaking in the Hadoop Cloud<br />Rackspace: Cross Data Center Log Processing<br />Visible Technologies: Real-Time Business Intelligence<br />Booz Allen Hamilton: Protein Alignment using Hadoop<br />Slides and Videos at<br />
  13. 13. Big Data in the EnterpriseCase Studies: Hadoop World 2010<br />eBay: Hadoop at eBay<br />Twitter: The Hadoop Ecosystem at Twitter<br />General Electric: Sentiment Analysis powered by Hadoop<br />Yale University: MapReduce and Parallel Database Systems<br />AOL: AOL’s Data Layer<br />Facebook: Hbase in Production <br />Bank of America: The Business of Big Data<br />StumbleUpon: Mixing Real-Time and Batch Processing<br />Raytheon: SHARD: Storing and Querying Large-Scale Data<br />More info at -<br />
  14. 14. Agenda<br />Big Data <br />Big Data in Enterprise<br />Informatica & Data<br />Informatica & Big Data<br />
  15. 15. Informatica – Our Singular Mission Enabling The Information Economy <br /> We enable organizations to gain a competitive advantage from all their information assetsto drive their top business imperatives<br />
  16. 16. Informatica – What We DoComprehensive, Unified, Open and Economical platform<br />Application<br />Partner Data<br />SWIFT<br />NACHA<br />HIPAA<br />…<br />Cloud Computing<br />Unstructured<br />Database<br />Complex<br />Event<br />Processing<br />Data <br />Warehouse<br />Data<br />Migration<br />Test Data<br />Management<br />& Archiving<br />Master Data<br />Management<br />Data <br />Synchronization<br />B2B Data<br />Exchange<br />Data<br />Consolidation<br />UltraMessaging<br />
  17. 17. Informatica & Data<br />Verbs on Data – We do things to data!<br />INFA = Data + [ <br />Archival | As a Service | Cleansing | Clustering | Consolidation | <br />Conversion | De-duping | Exchange | Extraction | Federation | <br />Hub | Identity | Integration | Life-cycle Management | <br />Loading | Masking | Mastering | Matching | Migration | On Demand | <br />Privacy | Profiling | Provisioning | Quality | Quality Assessment | <br />Registry | Replication | Retirement | Services | Stewardship | <br />Sub-setting | Synchronization | Test Management | Transformation | <br />Validation | Virtualization | Warehousing|<br />]<br />
  18. 18. Informatica & Big Data<br />HDFS as a source and a target - Enable universal data connectivity for Hadoop developers<br />Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic <br />Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool<br />Support virtualized access to data split across HDFS and (relational) data-warehouses<br />
  19. 19. Informatica & Hadoop – Big Picture<br />Enterprise <br />Connectivity for <br />Hadoop programs<br />Weblogs<br />Databases<br />BI<br />DW/DM<br />Metadata<br />Repository<br />Graphical IDE for<br />Hadoop Development<br />Semi-structured<br />Un-structured<br />Enterprise Applications<br />Transformation<br />Engine for custom<br />data processing<br />Hadoop Cluster<br />HDFS<br />Job Tracker<br />HDFS<br />Name Node<br />Data Node<br />HDFS<br />