SlideShare a Scribd company logo
1 of 15
Download to read offline
Using distributed technologies
to analyze Big Data

                    Abhijit Sharma
                    Innovation Lab
                    BMC Software




                                     1
Data Explosion in Data Center
• Performance / Time Series Data
    § Incoming data rates ~Millions of data
        points/ min
    § Data generated/server/year ~ 2 GB
    § 50 K servers ~ 100 TB data / year




                                              2
Online Warehouse - Time Series
   § Extreme storage requirements – TS data for a data center e.g. last
       year
   § Online TS data availability i.e. no separate ETL
   § Support for common analytics operations
           § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc
           § Slice and Dice – CPU util. for UNIX servers in SFO data center last week
           § Statistical Operations : sum, count, avg., var, std. moving avg., frequency
                distributions, forecasting etc
   § Ease of use – SQL interface, design schema for TS data
   § Horizontal scaling - lower cost commodity hardware
                                                            OS          Data Cube -
   § High R/W volume                                                    CPU
                                                                        Time
                                                   Data
                                                   Center




                                                                                      3
P
a
g    Why not use RDBMS based Data
e
4    Warehousing?
|    Star schema – dimensions & facts
6/5/11 §   Offline data availability – ETL required – not online
      § Expensive to scale vertically – High end Hardware & Software
      § Limits to vertical scaling – big data may not fit
      § Features like transactions etc are unnecessary and a overhead
          for certain applications
      § Large scale distributed/partitioning is painful – sub optimal
          on high W/R ratios
      § Flexible Schema support which can be changed on the fly is
           not possible

                                                                        4
High Level Architecture


  Real time Continuous                      Schema &
  load of Metric &                          Query
  Dimension Data


                         Hive – Distributed SQL


            NoSQL Column Store - HBase


            Hadoop HDFS & Map Reduce Framework




                          Map Reduce & HDFS Nodes
                                                       5
P
a
g
e
     Map Reduce - Recap
6        Map Function                                   Reduce Function
                        § Apply to input data, Emits         § Apply to data grouped by reduction key
|
                            reduction key and value          § Often ‘reduces’ data (for example –
6/5/11                  § Output of Map is sorted              sum(values))
                            and partitioned for use    Mappers and Reducers can be chained together
                            by Reducers
                                Mappers and Reducers can be chained together




                                                                                                6
P
a
g
e
     HDFS Sweet spot
7

|     § Big Data Storage : Optimized for large files (ETL)
6/5/11 §   Writes are create, append, and large
      § Reads are mostly big and streaming
      § Throughput is more important than latency
      § Distributed, HA, Transparent Replication




                                                             7
When is raw HDFS unsuitable?
• Mutable data – Create, Update, Delete
• Small writes
• Random reads, % of small reads
• Structured data
• Online access to data – HDFS Loading is
   offline / batch process


                                            8
P
a
g
e
     NoSQL Data stores - Column
9

|        § Excellent W/R concurrent performance – fast writes
             and fast reads (random and sequential) – this is
6/5/11
             required for near real time update of data to TS Data
         § Distributed architecture, horizontal scaling, transparent
             replication of data
         § Highly Available (HA) and Fault Tolerant (FT) for no
            SPOF – shared nothing architecture
         § Reasonably rich data model
         § Flexible in terms of schema – amenable to ad-hoc
             changes even at runtime



                                                                  9
P
a
g
e
     HBase
10
         § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored
|             value 
         § Table is split into multiple equal sized regions each of which is a range of
6/5/11       sorted keys (partitioned automatically by the key)
         § Ordered Rows by key, Ordered columns in a Column Family
         § Table schema defines Column Families
         § Rows can have different number of columns
         § Columns have value and versions (any number)
         § Column range and key range queries

          Row Key        Column Family (dimensions)       Column Family
                                                          (metric)
          112334-7782    server : host1   dc : PUNE       value:20

          112334-7783             server:host2            value:10

                                                                                      10
P
a
g
e
      Hive – Distributed SQL > MR
11
       § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining
|
           several Mappers & Reducers required
6/5/11 §
           Hive provides familiar SQL queries which automatically gets translated to a flow
              of appropriate Mappers and Reducers that execute the query leveraging MR.
       § Leverages Hadoop ecosystem - MR, HDFS, HBase

       § Hive defines a schema for the meta-tables it will use to build a schema its SQL
            queries can use and to store metadata
       § Storage Handlers for HDFS, HBase

       § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc
            clauses
       § Hive stores the data partitioned by partitions (you can specify partitioning key
            while loading Hive tables) and buckets (useful for statistical operations like
            sampling)
       § Hive queries can also include custom map/reduce tasks as scripts

                                                                                              11
Hive Queries - CREATE
TABLE                               EXTERNAL TABLE



CREATE TABLE wordfreq (word       CREATE external TABLE iops(key
  STRING, freq INT) ROW FORMAT      string, os string, deploymentsize
  DELIMITED FIELDS TERMINATED       string, ts int, value int) STORED
  BY 't' STORED AS TEXTFILE;       BY
                                    'org.apache.hadoop.hive.hbase.HB
LOAD DATA LOCAL INPATH              aseStorageHandler' WITH
  ‘freq.txt' OVERWRITE INTO TABLE   SERDEPROPERTIES
  wordfreq;                         ("hbase.columns.mapping" =
                                    ":key,data:os,data:deploymentSize,
                                    data:ts,data:value")




                                                                    12
Hive Queries - SELECT
TABLE                                      EXTERNAL TABLE
select * from wordfreq where freq >        select ts, avg(value) as cpu from
   100 sort by freq desc limit 3;             cpu_util_5min group by ts;
explain select * from wordfreq where       select architecture, avg(value) as cpu
   freq > 100 sort by freq desc limit 3;      from cpu_util_5min group by
                                              architecture;
select freq, count(*) AS f2 from
   wordfreq group by freq sort by f2
   desc limit 3;




                                                                                13
P
a
g
e
        Hive – SQL -> Map Reduce
     CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix
14
     SELECT timestamp, AVG(value)

|    FROM timeseries WHERE server-type = ‘Unix’


6/5/11 BY timestamp
   GROUP

           timeseries




                                                         Shuffle                             Reduce
                               Map
                                                          Sort




                                                                                                                               14
Thanks



         15

More Related Content

More from IndicThreads

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
IndicThreads
 

More from IndicThreads (20)

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web Applications
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to know
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities Project
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011

  • 1. Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1
  • 2. Data Explosion in Data Center • Performance / Time Series Data § Incoming data rates ~Millions of data points/ min § Data generated/server/year ~ 2 GB § 50 K servers ~ 100 TB data / year 2
  • 3. Online Warehouse - Time Series § Extreme storage requirements – TS data for a data center e.g. last year § Online TS data availability i.e. no separate ETL § Support for common analytics operations § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc § Slice and Dice – CPU util. for UNIX servers in SFO data center last week § Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc § Ease of use – SQL interface, design schema for TS data § Horizontal scaling - lower cost commodity hardware OS Data Cube - § High R/W volume CPU Time Data Center 3
  • 4. P a g Why not use RDBMS based Data e 4 Warehousing? | Star schema – dimensions & facts 6/5/11 § Offline data availability – ETL required – not online § Expensive to scale vertically – High end Hardware & Software § Limits to vertical scaling – big data may not fit § Features like transactions etc are unnecessary and a overhead for certain applications § Large scale distributed/partitioning is painful – sub optimal on high W/R ratios § Flexible Schema support which can be changed on the fly is not possible 4
  • 5. High Level Architecture Real time Continuous Schema & load of Metric & Query Dimension Data Hive – Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5
  • 6. P a g e Map Reduce - Recap 6 Map Function Reduce Function § Apply to input data, Emits § Apply to data grouped by reduction key | reduction key and value § Often ‘reduces’ data (for example – 6/5/11 § Output of Map is sorted sum(values)) and partitioned for use Mappers and Reducers can be chained together by Reducers Mappers and Reducers can be chained together 6
  • 7. P a g e HDFS Sweet spot 7 | § Big Data Storage : Optimized for large files (ETL) 6/5/11 § Writes are create, append, and large § Reads are mostly big and streaming § Throughput is more important than latency § Distributed, HA, Transparent Replication 7
  • 8. When is raw HDFS unsuitable? • Mutable data – Create, Update, Delete • Small writes • Random reads, % of small reads • Structured data • Online access to data – HDFS Loading is offline / batch process 8
  • 9. P a g e NoSQL Data stores - Column 9 | § Excellent W/R concurrent performance – fast writes and fast reads (random and sequential) – this is 6/5/11 required for near real time update of data to TS Data § Distributed architecture, horizontal scaling, transparent replication of data § Highly Available (HA) and Fault Tolerant (FT) for no SPOF – shared nothing architecture § Reasonably rich data model § Flexible in terms of schema – amenable to ad-hoc changes even at runtime 9
  • 10. P a g e HBase 10 § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored | value  § Table is split into multiple equal sized regions each of which is a range of 6/5/11 sorted keys (partitioned automatically by the key) § Ordered Rows by key, Ordered columns in a Column Family § Table schema defines Column Families § Rows can have different number of columns § Columns have value and versions (any number) § Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10
  • 11. P a g e Hive – Distributed SQL > MR 11 § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining | several Mappers & Reducers required 6/5/11 § Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. § Leverages Hadoop ecosystem - MR, HDFS, HBase § Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata § Storage Handlers for HDFS, HBase § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses § Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) § Hive queries can also include custom map/reduce tasks as scripts 11
  • 12. Hive Queries - CREATE TABLE EXTERNAL TABLE CREATE TABLE wordfreq (word CREATE external TABLE iops(key STRING, freq INT) ROW FORMAT string, os string, deploymentsize DELIMITED FIELDS TERMINATED string, ts int, value int) STORED BY 't' STORED AS TEXTFILE; BY 'org.apache.hadoop.hive.hbase.HB LOAD DATA LOCAL INPATH aseStorageHandler' WITH ‘freq.txt' OVERWRITE INTO TABLE SERDEPROPERTIES wordfreq; ("hbase.columns.mapping" = ":key,data:os,data:deploymentSize, data:ts,data:value") 12
  • 13. Hive Queries - SELECT TABLE EXTERNAL TABLE select * from wordfreq where freq > select ts, avg(value) as cpu from 100 sort by freq desc limit 3; cpu_util_5min group by ts; explain select * from wordfreq where select architecture, avg(value) as cpu freq > 100 sort by freq desc limit 3; from cpu_util_5min group by architecture; select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; 13
  • 14. P a g e Hive – SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix 14 SELECT timestamp, AVG(value) | FROM timeseries WHERE server-type = ‘Unix’ 6/5/11 BY timestamp GROUP timeseries Shuffle Reduce Map Sort 14
  • 15. Thanks 15