Big Data
Shankar Radhakrishnan
Topics

• Data Management Today


• New Interests, Expectations, Problems


• Big Data


• New Approach


• Big Data Ecosystem


• Q&A
Data Management Today

• Relational Databases


  • Oracle, MySQL, MS-SQL Server


• Data warehouse Appliances


  • Teradata, IBM-Netezza


• Legacy Systems


  • Mainframes
New Interests, Expectations

• Collect More, Data-Mine More   • Actionable Insights


• Complex Data Integration       • Extension of Investments


• Advanced Analytics             • Talent Management


• Social Data Analysis           • ROI


• Machine Data Analysis          • TCO


• Realtime Data Analysis         • Business Continuity
How Big is Data?




? BIG                            90
                                                                                          is the average


                                                              $214
                                                                                       amount companies
                                                                                        have to spend per
                                 of the world’s data                                      compromised

             Facts               was created in the
                                    last two years
                                                                                        customer when a
                                                                                       data breach occurs
(as of Oct 2012)




                                2.7bn
                                                  Average number
                                                    of “likes” and
                                                    “comments”
                                                      posted on
                                                                             247bn
                                                                                e-mail messages are sent each
                                                   Facebook daily                day… about 80% of them are
                                                                                           spam
    It would take 2,000 hours
     to watch all the YouTube

                                500,000+
      videos uploaded while                                     data centers across the world are large
       we’re talking on this                                      enough to fill 5,955 football fields
              panel*



                                 *this is 3x more than just 2 short years ago

5
New Problems

• Unpredictable Volume          • Computing Limitations


• Data Processing Issues        • Information vs. Insights


• Data Integration Issues       • Business Requirements


• Identifying Source-of-Truth   • Regulatory Requirements


• Store vs. Analyze             • True Value-of-Data


• Data Retrieval Requirements   • Price to Performance Dilemma
What is Big Data?


                •     Very large data sets                                •   Real-time data streaming
                •     Sizes from 100 TB to 50 PB                              data
                •     Larger than “one machine”                           •   High volume / Low latency
                •     Whole data set analysis                                   • Write heavy
                      replaces “sampling”                                       • Read heavy
                                                                                • Both is common
                                                   Volume     Velocity




                •     Structured data
                        • OLTP                     Variety   Complexity

                        • DW
                        • ODS
                        • Data marts
                •     Unstructured data                                   •   Complexity
                        • Text                                                 • Data acquisition
                        • Audio                                                • Analysis
                        • Video                                                • Deriving insights
                        • Click streams
                        • Log files



Source: Ventana Research
New Approach

• Commodity Hardware


  • Open Computing Project


• Open Source Solutions, Frameworks


  • Value Added Products – Cloudera, Datastax, 10gen


• Research Oriented Product Development


• Augmented Ecosystem
Big Data : Ecosystem


                                                    Advanced
                                                     Analytics
                                            Predictive & Optimization
                                               Modeling, Business
Data Analytics                                 Processes Analysis,
                                                                                                 R            Splunk
                                               Functional Analysis                                                                SAS Big Data
                                                                                                Madlib           Mahout
                                                                                                                                 Visual Analytics

                                                                                                           Tableau
                                            Advanced Visualizations
Data Delivery                        Data Delivery - Dashboards , Scorecard                                               SpotFire
                                      (Strategy Maps), Spatial & Temporal                            Datameer
Data Visualization                                   Analysis

                                                                                                  Pig            Hive       Other BI Tools with
Data Engineering                                  BI / Reporting                                                            Hadoop connectors
                             Data Engineering - Performance Reporting, Enterprise                       Lucene                     Karmasphere
Data Agility                 Metrics, Data Agility - Data Mining, OLAP Modeling etc
                                                                                                         Cassandra         Crunch          Pangool
Data Consolidation                        Data Storage and Processing
                                                                                                           HDFS           HBase        Mapreduce
                                          Data Storage, Data processing
Data Economics
                                                                                                                 Flume        Scribe          Avro
                                                                                                                         Sqoop              Chukwa
                                          Data Integration & Management                                                 Zookeeper            Oozie
                     Data Filtering, Data Consolidation & Warehousing, Data Quality, Metadata
Integration                         Management, Job Scheduling, Data Economics                                             Native Hadoop ETL

                                                                                                                           Traditional ETL with
                                                                                                                           Hadoop connectors
                                           Distributed Infrastructure

                                  Hadoop components                        Open source Hadoop platforms
                                   3rd party Hadoop supporting platforms
What Big Data can do that traditional data warehousing and analytics cannot?

               Traditional DW                                                 Big Data

Complete records from known transactional            Data from many different internal & external sources
systems.                                             with unknown quality and/or utility.
                                                 u

Data is structured, and data fields have known       Loosely structured data. Flat schemas with few
(and often complex) interrelationships.              complex interrelationships, connections between data
                                                 u   elements have to be probabilistically inferred.



Multi Terabytes of Data                              Multi Peta Bytes of Data
                                                 u

Mostly Scale Up Architecture                         Scale Out Architecture
                                                 u

                                                     The analytic models are larger and require very large
Analytics run on a stable data model.            u   amounts of hardware resources to process them in a
                                                     timely manner


Low Performance/Cost ratio as most of the            High Performance/Cost ratio as most of the software/
software/hardware platforms are proprietary      u   hardware platforms are commodity, free, open source
and license based


10
What Big Data can do that traditional data warehousing and analytics cannot?

                 Traditional DW                                           Big Data


 Aggregate data (structured)                    u    Raw Data (structured and unstructured)




                                                     Individual level analytics, Micro segmentation,
 Aggregate / Segment analytics                  u    individualized offers to customers


 Mainstream analytics
                                                     Outlier analytics, Pattern discovery, Simulation and
 – Structured analysis                          u    modeling, Machine learning
 - OLAP cubes


                                                     Entire population of granular data can be
 Sample data is used for identifying patterns   u    leveraged



Reports & Dashboards are done on a production       Real-time operational analytics and reporting. Intra-
basis                                           u   day decision making.


 Traditional models good for small amount of         Big Models: Computationally intensive analyses,
 data due to time constraints                   u    simulations, models with many parameters




11
Q&A




      Thank You !

Kurukshetra - Big Data

  • 1.
  • 2.
    Topics • Data ManagementToday • New Interests, Expectations, Problems • Big Data • New Approach • Big Data Ecosystem • Q&A
  • 3.
    Data Management Today •Relational Databases • Oracle, MySQL, MS-SQL Server • Data warehouse Appliances • Teradata, IBM-Netezza • Legacy Systems • Mainframes
  • 4.
    New Interests, Expectations •Collect More, Data-Mine More • Actionable Insights • Complex Data Integration • Extension of Investments • Advanced Analytics • Talent Management • Social Data Analysis • ROI • Machine Data Analysis • TCO • Realtime Data Analysis • Business Continuity
  • 5.
    How Big isData? ? BIG 90 is the average $214 amount companies have to spend per of the world’s data compromised Facts was created in the last two years customer when a data breach occurs (as of Oct 2012) 2.7bn Average number of “likes” and “comments” posted on 247bn e-mail messages are sent each Facebook daily day… about 80% of them are spam It would take 2,000 hours to watch all the YouTube 500,000+ videos uploaded while data centers across the world are large we’re talking on this enough to fill 5,955 football fields panel* *this is 3x more than just 2 short years ago 5
  • 6.
    New Problems • UnpredictableVolume • Computing Limitations • Data Processing Issues • Information vs. Insights • Data Integration Issues • Business Requirements • Identifying Source-of-Truth • Regulatory Requirements • Store vs. Analyze • True Value-of-Data • Data Retrieval Requirements • Price to Performance Dilemma
  • 7.
    What is BigData? • Very large data sets • Real-time data streaming • Sizes from 100 TB to 50 PB data • Larger than “one machine” • High volume / Low latency • Whole data set analysis • Write heavy replaces “sampling” • Read heavy • Both is common Volume Velocity • Structured data • OLTP Variety Complexity • DW • ODS • Data marts • Unstructured data • Complexity • Text • Data acquisition • Audio • Analysis • Video • Deriving insights • Click streams • Log files Source: Ventana Research
  • 8.
    New Approach • CommodityHardware • Open Computing Project • Open Source Solutions, Frameworks • Value Added Products – Cloudera, Datastax, 10gen • Research Oriented Product Development • Augmented Ecosystem
  • 9.
    Big Data :Ecosystem Advanced Analytics Predictive & Optimization Modeling, Business Data Analytics Processes Analysis, R Splunk Functional Analysis SAS Big Data Madlib Mahout Visual Analytics Tableau Advanced Visualizations Data Delivery Data Delivery - Dashboards , Scorecard SpotFire (Strategy Maps), Spatial & Temporal Datameer Data Visualization Analysis Pig Hive Other BI Tools with Data Engineering BI / Reporting Hadoop connectors Data Engineering - Performance Reporting, Enterprise Lucene Karmasphere Data Agility Metrics, Data Agility - Data Mining, OLAP Modeling etc Cassandra Crunch Pangool Data Consolidation Data Storage and Processing HDFS HBase Mapreduce Data Storage, Data processing Data Economics Flume Scribe Avro Sqoop Chukwa Data Integration & Management Zookeeper Oozie Data Filtering, Data Consolidation & Warehousing, Data Quality, Metadata Integration Management, Job Scheduling, Data Economics Native Hadoop ETL Traditional ETL with Hadoop connectors Distributed Infrastructure Hadoop components Open source Hadoop platforms 3rd party Hadoop supporting platforms
  • 10.
    What Big Datacan do that traditional data warehousing and analytics cannot? Traditional DW Big Data Complete records from known transactional Data from many different internal & external sources systems. with unknown quality and/or utility. u Data is structured, and data fields have known Loosely structured data. Flat schemas with few (and often complex) interrelationships. complex interrelationships, connections between data u elements have to be probabilistically inferred. Multi Terabytes of Data Multi Peta Bytes of Data u Mostly Scale Up Architecture Scale Out Architecture u The analytic models are larger and require very large Analytics run on a stable data model. u amounts of hardware resources to process them in a timely manner Low Performance/Cost ratio as most of the High Performance/Cost ratio as most of the software/ software/hardware platforms are proprietary u hardware platforms are commodity, free, open source and license based 10
  • 11.
    What Big Datacan do that traditional data warehousing and analytics cannot? Traditional DW Big Data Aggregate data (structured) u Raw Data (structured and unstructured) Individual level analytics, Micro segmentation, Aggregate / Segment analytics u individualized offers to customers Mainstream analytics Outlier analytics, Pattern discovery, Simulation and – Structured analysis u modeling, Machine learning - OLAP cubes Entire population of granular data can be Sample data is used for identifying patterns u leveraged Reports & Dashboards are done on a production Real-time operational analytics and reporting. Intra- basis u day decision making. Traditional models good for small amount of Big Models: Computationally intensive analyses, data due to time constraints u simulations, models with many parameters 11
  • 12.
    Q&A Thank You !