Your SlideShare is downloading. ×
Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011


Published on

Session presented at the 2nd Conference on Cloud Computing held in Pune, India on 3-4 June 2011. …

Session presented at the 2nd Conference on Cloud Computing held in Pune, India on 3-4 June 2011.

Abstract: “IT systems today, are being used to manage, monitor and analyze cloud scale infrastructures. This involves large scale collection and analysis of data related to hundreds of performance measures (like CPU, Memory utilization, Job queue size etc) for hundreds of thousands of servers and applications in a cloud scale data center with a fairly short sampling rates ranging in seconds. This scale yields millions of concurrent time series observations and an extremely large quantum of data (TB’s).

This data is used in the enterprise for real-time monitoring, predictive analytics, capacity planning, application/virtual machine placement, root cause analysis of events etc. The sheer volume and size of the time series data stream makes it is quite challenging to store this massive amount of data and to support prompt analytics using traditional approaches like data warehousing.

With the advent and rising popularity of distributed technologies like Hadoop, HBase, Hive etc large scale analytics on big data is becoming popular in the enterprise as well. These technologies are used in various social web sites like FaceBook to perform analytics on extremely large scale data. Hadoop is the underlying platform that provides the HDFS distributed file system and the framework for executing Map Reduce programs. HBase is a distributed NoSQL column data store based on HDFS and Hive provides an SQL layer on top of Hadoop/HBase which supports querying large scale data in a very developer friendly SQL like language.

In this session we introduce these technologies and explore using these non traditional technologies to solve the problems of big data storage and analytics in the enterprise.”

Speaker: Abhijit Sharma works as an architect/researcher with the Incubator & Innovation lab in BMC Software. He works on emerging technology areas and how they impact the BMC Software product portfolio and domain in which it operates, with a focus on a lot of different areas related to cloud, IT etc. He has a wide range of experience in architecting, designing and implementing different enterprise products working in his own startup, venture backed startup as well research lab in an established company like BMC Software

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Using distributed technologiesto analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1
  • 2. Data Explosion in Data Center• Performance / Time Series Data § Incoming data rates ~Millions of data points/ min § Data generated/server/year ~ 2 GB § 50 K servers ~ 100 TB data / year 2
  • 3. Online Warehouse - Time Series § Extreme storage requirements – TS data for a data center e.g. last year § Online TS data availability i.e. no separate ETL § Support for common analytics operations § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc § Slice and Dice – CPU util. for UNIX servers in SFO data center last week § Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc § Ease of use – SQL interface, design schema for TS data § Horizontal scaling - lower cost commodity hardware OS Data Cube - § High R/W volume CPU Time Data Center 3
  • 4. Pag Why not use RDBMS based Datae4 Warehousing?| Star schema – dimensions & facts6/5/11 § Offline data availability – ETL required – not online § Expensive to scale vertically – High end Hardware & Software § Limits to vertical scaling – big data may not fit § Features like transactions etc are unnecessary and a overhead for certain applications § Large scale distributed/partitioning is painful – sub optimal on high W/R ratios § Flexible Schema support which can be changed on the fly is not possible 4
  • 5. High Level Architecture Real time Continuous Schema & load of Metric & Query Dimension Data Hive – Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5
  • 6. Page Map Reduce - Recap6 Map Function Reduce Function § Apply to input data, Emits § Apply to data grouped by reduction key| reduction key and value § Often ‘reduces’ data (for example –6/5/11 § Output of Map is sorted sum(values)) and partitioned for use Mappers and Reducers can be chained together by Reducers Mappers and Reducers can be chained together 6
  • 7. Page HDFS Sweet spot7| § Big Data Storage : Optimized for large files (ETL)6/5/11 § Writes are create, append, and large § Reads are mostly big and streaming § Throughput is more important than latency § Distributed, HA, Transparent Replication 7
  • 8. When is raw HDFS unsuitable?• Mutable data – Create, Update, Delete• Small writes• Random reads, % of small reads• Structured data• Online access to data – HDFS Loading is offline / batch process 8
  • 9. Page NoSQL Data stores - Column9| § Excellent W/R concurrent performance – fast writes and fast reads (random and sequential) – this is6/5/11 required for near real time update of data to TS Data § Distributed architecture, horizontal scaling, transparent replication of data § Highly Available (HA) and Fault Tolerant (FT) for no SPOF – shared nothing architecture § Reasonably rich data model § Flexible in terms of schema – amenable to ad-hoc changes even at runtime 9
  • 10. Page HBase10 § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored| value  § Table is split into multiple equal sized regions each of which is a range of6/5/11 sorted keys (partitioned automatically by the key) § Ordered Rows by key, Ordered columns in a Column Family § Table schema defines Column Families § Rows can have different number of columns § Columns have value and versions (any number) § Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10
  • 11. Page Hive – Distributed SQL > MR11 § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining| several Mappers & Reducers required6/5/11 § Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. § Leverages Hadoop ecosystem - MR, HDFS, HBase § Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata § Storage Handlers for HDFS, HBase § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses § Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) § Hive queries can also include custom map/reduce tasks as scripts 11
  • 12. Hive Queries - CREATETABLE EXTERNAL TABLECREATE TABLE wordfreq (word CREATE external TABLE iops(key STRING, freq INT) ROW FORMAT string, os string, deploymentsize DELIMITED FIELDS TERMINATED string, ts int, value int) STORED BY t STORED AS TEXTFILE; BY org.apache.hadoop.hive.hbase.HBLOAD DATA LOCAL INPATH aseStorageHandler WITH ‘freq.txt OVERWRITE INTO TABLE SERDEPROPERTIES wordfreq; ("hbase.columns.mapping" = ":key,data:os,data:deploymentSize, data:ts,data:value") 12
  • 13. Hive Queries - SELECTTABLE EXTERNAL TABLEselect * from wordfreq where freq > select ts, avg(value) as cpu from 100 sort by freq desc limit 3; cpu_util_5min group by ts;explain select * from wordfreq where select architecture, avg(value) as cpu freq > 100 sort by freq desc limit 3; from cpu_util_5min group by architecture;select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; 13
  • 14. Page Hive – SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix14 SELECT timestamp, AVG(value)| FROM timeseries WHERE server-type = ‘Unix’6/5/11 BY timestamp GROUP timeseries Shuffle Reduce Map Sort 14
  • 15. Thanks 15