• Save
Hadoop For Enterprises
Upcoming SlideShare
Loading in...5

Hadoop For Enterprises



Big Data is a mega trend in IT and Hadoop is front runner. This is a short overview of Hadoop, eco-system and reference architectures

Big Data is a mega trend in IT and Hadoop is front runner. This is a short overview of Hadoop, eco-system and reference architectures



Total Views
Views on SlideShare
Embed Views



29 Embeds 504

http://buildingabetterbi.blogspot.com 240
http://buildingabetterbi.blogspot.in 84
http://hadoopsimplified.blogspot.com 61
http://buildingabetterbi.blogspot.co.uk 20
http://buildingabetterbi.blogspot.com.es 11
http://buildingabetterbi.blogspot.fr 9
http://buildingabetterbi.blogspot.be 7
http://buildingabetterbi.blogspot.co.il 7
http://buildingabetterbi.blogspot.it 6
http://buildingabetterbi.blogspot.sg 6
http://buildingabetterbi.blogspot.ca 6
http://apps.activemailservice.com 5
http://buildingabetterbi.blogspot.tw 5
http://buildingabetterbi.blogspot.com.au 5
http://hadoopsimplified.blogspot.in 4
http://buildingabetterbi.blogspot.jp 4
http://buildingabetterbi.blogspot.nl 3
http://www.linkedin.com 3
http://buildingabetterbi.blogspot.de 3
http://www.onlydoo.com 2
http://us-w1.rockmelt.com 2
http://buildingabetterbi.blogspot.ru 2
http://buildingabetterbi.blogspot.ch 2
http://www.techgig.com 2
http://buildingabetterbi.blogspot.ae 1
http://buildingabetterbi.blogspot.mx 1
http://webcache.googleusercontent.com 1
http://buildingabetterbi.blogspot.ie 1
http://buildingabetterbi.blogspot.fi 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • (http://www.zdnet.com/blog/big-data/hadoop-20-mapreduce-in-its-place-hdfs-all-grown-up/267http://www.informationweek.com/news/software/info_management/231900633http://www.informationweek.com/news/software/info_management/232400021http://www.informationweek.com/news/software/info_management/232300181http://www-01.ibm.com/software/data/infosphere/biginsights/basic.htmlhttp://www.informationweek.com/news/galleries/software/enterprise_apps/232500290?pgno=1
  • http://www.informationweek.com/news/software/bi/229900002?cid=RSSfeed_IWK_News
  • http://www.informationweek.com/news/software/bi/229900002?cid=RSSfeed_IWK_News
  • Hadoop implements the core features that are at the heart of most modern EDWs: cloud-facing architectures, MPP, in-database analytics, mixed workload management, and a hybrid storage layer
  • http://wiki.apache.org/hadoop/PoweredBy
  • HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset. Sqoop takes data from any DB that supports JDBC and moves it into HDFSIf you haven’t already, check out Toad® for Cloud Databases, our free, fully functional, commercial-grade cloud solution. With Toad for Cloud Databases, you can easily generate queries, migrate, browse, and edit data, as well as create reports and tables – all in a familiar SQL view. Finally, everyone can experience the productivity gains and cost benefits of NoSQL and big data – without the headachesToad for Cloud Databases provides unrivaled support for Apache Hive, Apache HBase, Apache Cassandra, MongoDB, Amazon SimpleDB, Microsoft Azure Table Services, Microsoft SQL Azure, and all open database connectivity (ODBC)-enabled relational databases (including Oracle, SQL Server, MySQL, DB2, and others)
  • Netflix uses similar reference architecture for movie recommendations. Hadoop is not suited for low latency. Facebook does use Hbase for messaging which is close to a real time functionality
  • http://facility9.com/nosql/glossary/
  • Weka read this… it is similar… Mahoot is AI…

Hadoop For Enterprises Hadoop For Enterprises Presentation Transcript

  • Hadoop for Enterprise rev 7Rajesh NadipalliMar 2012rajesh.nadipalli@gmail.com
  • Hadoop getting attention• Feb 2012: Microsoft, Hortonworks in partnership to develop Excel plug-in for Hadoop• Jan 2012: Oracle announces Big Data Appliance with Cloudera’s Hadoop distribution• Dec 2011: EMC released Unified Analytics Platform which includes Greenplum Apache Hadoop distribution• Oct 2011: Microsoft plans to add Hadoop support to SQL server 2012• May 2010: IBM introduces Hadoop based InfoSphereBigInsights Rajesh.nadipalli@gmail.com
  • In this Presentation…  Big Data – Big Opportunities  Hadoop for Enterprise – Reference Arch  Map Reduce Overview  Hive  References Rajesh.nadipalli@gmail.com
  • BIG DATA – BIGOPPORTUNITIES Rajesh.nadipalli@gmail.c
  • Big Data - BusinessOpportunityEnterprises today are challenged with.. Exponential data growth Complex data needs- structured & unstructured Real time insights with key indicators Heterogeneous environment: private and public clouds Tighter budgets and the need to do more with less Traditional relational databases are not able to scale and meet these challenges Rajesh.nadipalli@gmail.com
  • http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinarsBig Data – 4 V’s (Forrester) Rajesh.nadipalli@gmail.com
  • Why Hadoop? Hadoop provides…  Distributed File System  Parallel computing across several nodes  Support for structured and un-structured content  Fault tolerance and linear scalability  Open source under Apache foundation  Increasing support from vendors  Key Philosophy: “moving compute is cheaper than moving data”Forrester regards Hadoop as the nucleus of the next-generation EDW in thecloud. Rajesh.nadipalli@gmail.com
  • Some users of Hadoop… http://wiki.apache.org/hadoop/PoweredBy • Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. • Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. • Each (commodity) node has 8 cores and 12 TB of storage. • Hadoop used to analyze the log of search and do some mining work on web page database • We handle about 3000TB per week Our clusters vary from 10 to 500 nodes • 532 nodes cluster (8 * 532 cores, 5.3PB). • Heavy usage of Java MapReduce, Pig, Hive, HBase • Using it for search optimization and Research •5 machine cluster (8 cores/machine, 5TB/machine storage) •Existing 19 virtual machine cluster (2 cores/machine 30TB storage •Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) •Daily batch ETL; Log analysis; Data mining; Machine learning Rajesh.nadipalli@gmail.com
  • HADOOP REFERENCEARCHITECTURE Rajesh.nadipalli@gmail.c
  • Hadoop for Enterprise – Technology Stack User Experience Ad-hoc Notifications Embedded Analytics queries /Alerts Search Data Access Excel R (Rhipe, Hive Pig Datameer RBits) Zookeeper (Orchestration, Quorum) Pentaho (Scheduling, Integrations) Data Processing Mapreduce Hadoop Data Store Hbase (NOSQL DB) HDFS Sqoop Data Sources Application Database Log RSS Cloud Others s (internal) s Files Feeds Rajesh.nadipalli@gmail.com
  • Hadoop for BI – Reference Architecture Data Hadoop Distributed Computing Enterprise Apps Environment Dashboards RDBMS Excel M XML A N-Node JSON P scalable cluster ERP, Enterprise Apps Binary R E CSV D U Log C Import RDBM E S Java Hadoop File Objects System (HDFS) Rajesh.nadipalli@gmail.com
  • http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/wp-big-data-with-oracle- 521209.pdf?ssSourceSiteId=ocomen Oracle’s Big Data Solution• Oracle sees Hadoop is good for unstructured sourcing and map reduce.• It recommends to use Oracle database for the final analyze stage• Oracle Data Integrator can make Hive queries (ETL)• Oracle has a wrapper on top of sqoop which is called Oraoop (seereferences) Rajesh.nadipalli@gmail.com
  • DATA PROCESSING Rajesh.nadipalli@gmail.c
  • Hadoop Mapreduce Overview Map Reduce Process Node 1 010101010101010101010 Node 1 10 222222222222222222 010101010101010101010 Map 3333333333333333333 10 3333333333333333333 DATA (from HDFS) 010101010101010101010 10 RESULTS0101010101010101010101001010101010101010101010 Node 2 222222222222222201010101010101010101010 2201010101010101010101010 010101010101010101010 Node 2 333333333333333301010101010101010101010 Reduc Split 10 3301010101010101010101010 2222222222222222222 e 010101010101010101010 444444444444444401010101010101010101010 Map 3333333333333333333 10 4401010101010101010101010 4444444444444444444 01010101010101010101001010101010101010101010 10010101010101010101010100101010101010101010101001010101010101010101010 Node 3 Node 3 010101010101010101010 10 010101010101010101010 222222222222222222 10 Map 3333333333333333333 010101010101010101010 3333333333333333333 10 Rajesh.nadipalli@gmail.com
  • Map Reduce Tips The first step is to understand what data you have, and how to feed it into the Hadoop distributed computing environment. Using distributed applications, provide analytics of the massive data sets, while simultaneously enabling the surfacing of opportunities. Hadoop stores your information for future queries, enhancing the exploratory capabilities (as well as historical reference) of your data. Rajesh.nadipalli@gmail.com
  • DATA STORE Rajesh.nadipalli@gmail.c
  • HDFS Distributed file system consisting of ◦ One single node is called “Namenode” and has metadata ◦ Several “Datanodes” Designed to run on commodity hardware Data gets imported as blocks (64 MB) These Blocks are replicated (typically 3 copies) to protect for hardware failures Access via Java API’s or hadoop command line ($hadoop fs…) Rajesh.nadipalli@gmail.com
  • http://hadoop.apache.org/common/docs/current/hdfs_design. htmlHDFS architectureHadoop next revision has a failover Namenode called “Avatar” Rajesh.nadipalli@gmail.com
  • HBase Distributed, column-oriented database (NoSQL) Failure-tolerant Low latency HDFS aware Access via Java APIs or REST APIs It is not a replacement for RDBMS Recommended to use Hbase when ◦ Data is searched by key (or range) ◦ Data does not conform to a schema (for instance if you have attributes that change by record). Rajesh.nadipalli@gmail.com
  • Hbase Architecture Zookeeper Avatar Hbase (Failover of Master master) Region Region Region Region Server Server Server Server Zookeeper maintains quorum and knows which server is the master Master keeps track of regions and region servers Region servers store table regions Rajesh.nadipalli@gmail.com
  • Hbase Column StorageHbase stores data like tags for a key; for example:Row Column Column Cell Family Cast Cast:Actor1 Harrison FordStar Wars Cast:Actor2 Carrie Fisher Reviews Review: IMDB Review URL Review: ET Review URL2 Rajesh.nadipalli@gmail.com
  • DATA ACCESS Rajesh.nadipalli@gmail.c
  • Hive Overview Data warehouse software built on top of Hadoop HiveQL provides a SQL like interface and performs a map reduce job Provides structure to HDFS data similar to Oracle External table Rajesh.nadipalli@gmail.com
  • Hive Architecture Hive CLIBrowse Query Hive QL Hive ParserMetastore Execution SerDe (Map Reduce) HDFS Rajesh.nadipalli@gmail.com
  • Pig Overview Pig is a layer on top of map-reduce for statisticians (programmers) It provides several standard operators: join, order by etc It allows user defined functions to be included. Java or phyton supported for UDF’s Rajesh.nadipalli@gmail.com
  • http://www.datameer.com/Datameer OverviewKey philosophy: Business users understand Excel; let them do the grouping, sorting, filtering, aggregatesKey Steps: Datameer’s source is a mapreduce output. Datameer takes a quick sample of 5000 records. The end user is next presented an Excel like interface on top of this 5000 records. This is where the end users can define their filters, formula, grouping, aggregations, joins across sheets (even join hadoop data with data from a relational database table) Once the end user has defined what they want as the end result, they can submit a job to run on the complete dataset. Datameer will then build the necessary map reduce jobs and run it on the complete data set. Next the user gets the results and can build charts, tables etc – all on the browser Rajesh.nadipalli@gmail.com
  • http://www.informationweek.com/news/software/info_management/232601675?cid=RSSfeed_IWK_NewsExcel IntegrationMicrosoft announced Excel integration with Hadoop (Feb 2012) with HortonWorksKey Highlights: Microsoft &Hortonworks will deliver a Hive ODBC driver that will enable integration with Excel Microsoft’s PowerPivot in-memory plug-in for Excel will handle larger data sets There is also a plan for Javascript framework for Hadoop enabling Ajax like iterative Rajesh.nadipalli@gmail.com
  • INTEGRATION,SCHEDULING Rajesh.nadipalli@gmail.c
  • Pentaho Data Integration Pentaho is considered as “strong performer” by Forrestor (Feb 2012) It makes building MapReduce easy via it’s Data Integration IDE. It can read/write to HDFS, run map reduce and Pig scripts The IDE has several standard connectors, transformation, and allows custom java code http://www.pentaho.com/big-data/ Rajesh.nadipalli@gmail.com
  • http://www.youtube.com/watch?v=KZe1UugxXcs&feature=player_emb edded Pentaho Data Integration Build Reducer 21 Build Mapper Run Map 3 Reduce Rajesh.nadipalli@gmail.com
  • Talend - ETL Talend is another ETL development, scheduling and monitoring tool It supports HDFS, Pig, Hive, Sqoop http://www.talend.com/products-big- data/ Rajesh.nadipalli@gmail.com
  • Talend ETL – with Hadoop • Can invoke Hadoopcalls (generates Hive queries) • See right slide “Processing” Rajesh.nadipalli@gmail.com
  • USER EXPERIENCE Rajesh.nadipalli@gmail.c
  • User ExperienceThis layer of stack is generally custom development. However some tools that work with Hadoop are: Tableau for data analysis & visualizations SAS Enterprise Miner IBM BigInsights Rajesh.nadipalli@gmail.com
  • REFERENCES Rajesh.nadipalli@gmail.c
  • References http://hadoop.apache.org/ http://www.cloudera.com/ http://www-01.ibm.com/software/data/bigdata/ http://www.cs.duke.edu/starfish/index.html http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single- node-cluster/ http://karmasphere.com/Download/karmasphere-studio-community-virtual- appliance-for-ibm.html http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop- presentation http://www.slideshare.net/trihug/trihug-november-pig-talk-by-alan- gates?from=ss_embed http://www.trihug.org/ http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper _c11-690561.html http://www.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With- OraHive.pdf Rajesh.nadipalli@gmail.com
  • http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-690561.html Rajesh.nadipalli@gmail.com
  • http://wiki.apache.org/hadoop/PoweredByKey Hadoop Players Rajesh.nadipalli@gmail.com
  • MAP-R No single point of failure of name node Performance improvements (5 times faster than HDFS) Snapshots, Multi-site copies They have separate Mapreduce (extended mapreduce) MapR is 8K blocks instead of 64MB block size of HDFS Rajesh.nadipalli@gmail.com
  • Open Topics – why there isadoption issue  Security – no concept of roles  Backup, Recovery  ACID not supported Rajesh.nadipalli@gmail.com
  • Thank You to my viewers Rajesh.nadipalli@gmail.com
  • Questions / CommentsRajesh.nadipalli@gmail.com