• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,782
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
224
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Map/Reduce implementationApache Open Source Project : Yahoo dominatedTwo major componentsHDFSFailure Resilient Distributed File SystemsMap/ReduceFailure Resilient Distributed Computing FrameworkScales to thousand+ node clusterUsed by Yahoo, Facebook etc

Transcript

  • 1. Informatica & Big Data
    Sanjeev Kumar
    VP & MD, Informatica India
    Apache Hadoop India Summit 2011
  • 2. Agenda
    Big Data
    Big Data in Enterprise
    Informatica & Data
    Informatica & Big Data
  • 3. Why “Big Data” Now? : Exploding Data Volumes
    Complex, Unstructured
    Relational
    • 2,500 exabytes of new information in 2012 with Internet as primary driver
    • 4. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
    Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.
    .
  • 5. Why Now? Exploding Data Volumes
    Explosion in user-generated content
    e.g. Blogs, Twitter, Facebook etc.
    Proliferation of web-connected devices
    Smartphone interactions with the web
    Increased consumption of digital content
    Netflix, HULU, Pandora etc.
    Internet of things
    Smart-grid and smart-meters
    Machine-generated data via the web
  • 6. Why Now? : New Apps/Use-cases
    Analyze customer/market sentiment
    Text analytics on Social Media, blogs
    Achieve Operational Efficiency
    e.g. Analyze CDRs to optimize cell tower placements
    Make Recommendations
    Data mining on click-stream, purchase history
    Predict the future
    e.g. Flightcast predicts flight delays
  • 7. Big Data Challenges
    Storage
    Cost-effective Scalability: to multi-terabytes and petabytes
    Non-traditional data models: complex, semi-structured data
    Processing
    Data mining, collaborative filtering for structured data
    Text Analytics, classification etc. for unstructured data
    Regulatory Compliance
    Data Privacy / Masking
    Data Archival
  • 8. Addressing Big Data Challenges
    Storage
    Parallel Databases
    Greenplum(EMC), Vertica, AsterData
    Distributed Key/Value Stores
    Hbase, Google’s BigTable, Amazon’s SimpleDB
    Distributed File Systems
    HDFS, GFS, ParAccel
    Analytics
    SQL with extensions
    Map Reduce
    DataFlow Languages : PIG, Sawzall etc
  • 9. Hadoop Technology Stack
    Pig
    Hive
    Cascading
    ZooKeeper
    Map/Reduce
    HBase
    HDFS
  • 10. Hadoop Momentum
    Job Trends from Indeed.com
    Search Volume Index
    News Reference Volume
  • 11. Big Data in the Enterprise – Hadoop Usage
  • 12. Big Data in the EnterpriseCase Studies: Hadoop World 2009
    Yahoo!: Social Graph Analysis
    VISA: Large Scale Transaction Analysis
    China Mobile: Data Mining Platform for Telecom Industry
    JP Morgan Chase: Data Processing for Financial Services
    eHarmony: Matchmaking in the Hadoop Cloud
    Rackspace: Cross Data Center Log Processing
    Visible Technologies: Real-Time Business Intelligence
    Booz Allen Hamilton: Protein Alignment using Hadoop
    Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
  • 13. Big Data in the EnterpriseCase Studies: Hadoop World 2010
    eBay: Hadoop at eBay
    Twitter: The Hadoop Ecosystem at Twitter
    General Electric: Sentiment Analysis powered by Hadoop
    Yale University: MapReduce and Parallel Database Systems
    AOL: AOL’s Data Layer
    Facebook: Hbase in Production
    Bank of America: The Business of Big Data
    StumbleUpon: Mixing Real-Time and Batch Processing
    Raytheon: SHARD: Storing and Querying Large-Scale Data
    More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/
  • 14. Agenda
    Big Data
    Big Data in Enterprise
    Informatica & Data
    Informatica & Big Data
  • 15. Informatica – Our Singular Mission Enabling The Information Economy
    We enable organizations to gain a competitive advantage from all their information assetsto drive their top business imperatives
  • 16. Informatica – What We DoComprehensive, Unified, Open and Economical platform
    Application
    Partner Data
    SWIFT
    NACHA
    HIPAA

    Cloud Computing
    Unstructured
    Database
    Complex
    Event
    Processing
    Data
    Warehouse
    Data
    Migration
    Test Data
    Management
    & Archiving
    Master Data
    Management
    Data
    Synchronization
    B2B Data
    Exchange
    Data
    Consolidation
    UltraMessaging
  • 17. Informatica & Data
    Verbs on Data – We do things to data!
    INFA = Data + [
    Archival | As a Service | Cleansing | Clustering | Consolidation |
    Conversion | De-duping | Exchange | Extraction | Federation |
    Hub | Identity | Integration | Life-cycle Management |
    Loading | Masking | Mastering | Matching | Migration | On Demand |
    Privacy | Profiling | Provisioning | Quality | Quality Assessment |
    Registry | Replication | Retirement | Services | Stewardship |
    Sub-setting | Synchronization | Test Management | Transformation |
    Validation | Virtualization | Warehousing|
    ]
  • 18. Informatica & Big Data
    HDFS as a source and a target - Enable universal data connectivity for Hadoop developers
    Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic
    Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool
    Support virtualized access to data split across HDFS and (relational) data-warehouses
  • 19. Informatica & Hadoop – Big Picture
    Enterprise
    Connectivity for
    Hadoop programs
    Weblogs
    Databases
    BI
    DW/DM
    Metadata
    Repository
    Graphical IDE for
    Hadoop Development
    Semi-structured
    Un-structured
    Enterprise Applications
    Transformation
    Engine for custom
    data processing
    Hadoop Cluster
    HDFS
    Job Tracker
    HDFS
    Name Node
    Data Node
    HDFS