Big Data
          in 10
What’s real and what’s fluff
   Abhishek Pamecha
        Mar-2013
What is Big Data
• It is all about data
   – But not about “how much”




   – But about correlations and increased reach
BigData Architecture
It influences or changes your
   • Data source choices
   • Data storing choices
   • Data analyzing/mining approaches



It helps
   • Address highly focused use cases
   • Correlate more data sources
   • address scale and fault tolerance issues
Caution!



BigData is not a “substitute” for existing warehousing practices.
It complements existing practices.
Architectures – Data sources
• Traditional DW           • BigData adds

   – Production DB            – Log files

   – Dictionaries             – Social graphs

   – ETL/ELT pipelines        – Streaming data

   – External Data marts
Architectures – Data Storage
• Traditional DW               • BigData adds

  – Production DB                – Distributed file storage
      • Flatten hierarchies
      • Resolved references      – Distributed hash maps

                                 – Columnar representations
  – ROLAP or MOLAP databases
      •   Star schema
                                 – Graph data bases
      •   Materialized views
      •   Virtual data marts
                                 – Document collections
      •   Partitioned tables

  – Still relational             – Other NoSQL variants
Architectures – Analytic approaches
•   Traditional DW                                  •   BigData adds

     – Production DB                                     – Distributed file storage
         •   Flatten hierarchies                              •   Map reduce frameworks and chaining
         •   Resolved references
                –   Pre-generate results
                                                         – Distributed hash maps
                                                              •   Single key predominant
     – ROLAP databases
         •   Star schema
                –   Multidimensional queries
                                                         – Columnar representations
         •   Materialized views                               •   Extracts select columns per row
                –   adhoc explorations on subsets
         •   Still relational                            – Graph data bases
         •   Virtual data marts                               •   Navigate links
                –   adhoc explorations on subsets
         •   Partitioned tables
                                                         – Document collections
                                                              •   Simplified schemas

                                                         – Other NoSQL approaches
                                                              •   Stream pattern matching and pipelining
Big Data Architectures
                                 Pros and Cons
•   Pros

     –     Incorporate low value and social data in analysis
     –     Increase analysis reach to non-structured data
     –     Correlate across data sources on the same platform
     –     Very strong in their sweet spots.
     –     Efficiency in terms of
                •   data movement volume,
                •   scale
                •   fault tolerance and
                •   responsiveness.

•     Cons

     –          Not relational. Gives up on some of the relational advantages.
            •         Joins
            •         Aggregations etc.
     –          Little standards – Non portable solutions
     –          Less support with end-user tools and applications [ though growing ]
     –          Not a replacement to DW but just an extension to it.
     –          Incompatible with different classes of use-cases. Have sweet spots.
     –          Heterogeneous setup in Development and Operations.
Challenges
•   Architectural
     –   “Big” data management
     –   Data consistency
     –   Read heavy or write heavy
     –   Scaling
     –   Distributed deployment


•   Functional
     –   data quality
     –   Problem set choice


•   Organizational
     –   Data backed decisions
     –   Going overboard
     –   SLAs and operations management
     –   Data Privacy
Thank you!

Bigdata

  • 1.
    Big Data in 10 What’s real and what’s fluff Abhishek Pamecha Mar-2013
  • 2.
    What is BigData • It is all about data – But not about “how much” – But about correlations and increased reach
  • 3.
    BigData Architecture It influencesor changes your • Data source choices • Data storing choices • Data analyzing/mining approaches It helps • Address highly focused use cases • Correlate more data sources • address scale and fault tolerance issues
  • 4.
    Caution! BigData is nota “substitute” for existing warehousing practices. It complements existing practices.
  • 5.
    Architectures – Datasources • Traditional DW • BigData adds – Production DB – Log files – Dictionaries – Social graphs – ETL/ELT pipelines – Streaming data – External Data marts
  • 6.
    Architectures – DataStorage • Traditional DW • BigData adds – Production DB – Distributed file storage • Flatten hierarchies • Resolved references – Distributed hash maps – Columnar representations – ROLAP or MOLAP databases • Star schema – Graph data bases • Materialized views • Virtual data marts – Document collections • Partitioned tables – Still relational – Other NoSQL variants
  • 7.
    Architectures – Analyticapproaches • Traditional DW • BigData adds – Production DB – Distributed file storage • Flatten hierarchies • Map reduce frameworks and chaining • Resolved references – Pre-generate results – Distributed hash maps • Single key predominant – ROLAP databases • Star schema – Multidimensional queries – Columnar representations • Materialized views • Extracts select columns per row – adhoc explorations on subsets • Still relational – Graph data bases • Virtual data marts • Navigate links – adhoc explorations on subsets • Partitioned tables – Document collections • Simplified schemas – Other NoSQL approaches • Stream pattern matching and pipelining
  • 8.
    Big Data Architectures Pros and Cons • Pros – Incorporate low value and social data in analysis – Increase analysis reach to non-structured data – Correlate across data sources on the same platform – Very strong in their sweet spots. – Efficiency in terms of • data movement volume, • scale • fault tolerance and • responsiveness. • Cons – Not relational. Gives up on some of the relational advantages. • Joins • Aggregations etc. – Little standards – Non portable solutions – Less support with end-user tools and applications [ though growing ] – Not a replacement to DW but just an extension to it. – Incompatible with different classes of use-cases. Have sweet spots. – Heterogeneous setup in Development and Operations.
  • 9.
    Challenges • Architectural – “Big” data management – Data consistency – Read heavy or write heavy – Scaling – Distributed deployment • Functional – data quality – Problem set choice • Organizational – Data backed decisions – Going overboard – SLAs and operations management – Data Privacy
  • 10.