Apache Hadoop and its role in Big Data architecture - Himanshu Bari


Published on

In today’s world of exponentially growing big data, enterprises are becoming increasingly more aware of the business utility and necessity of harnessing, storing and analyzing this information. Apache Hadoop has rapidly evolved to become a leading platform for managing and processing big data, with the vital management, monitoring, metadata and integration services required by organizations to glean maximum business value and intelligence from their burgeoning amounts of information on customers, web trends, products and competitive markets. In this session, Hortonworks' Himanshu Bari will discuss the opportunities for deriving business value from big data by looking at how organizations utilize Hadoop to store, transform and refine large volumes of this multi-structured information. Connolly will also discuss the evolution of Apache Hadoop and where it is headed, the component requirements of a Hadoop-powered platform, as well as solution architectures that allow for Hadoop integration with existing data discovery and data warehouse platforms. In addition, he will look at real-world use cases where Hadoop has helped to produce more business value, augment productivity or identify new and potentially lucrative opportunities.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Hadoop and its role in Big Data architecture - Himanshu Bari

  1. 1. © Hortonworks Inc. 2013The Rise of Apache Hadoop……and its Role in Enterprise Data ArchitecturesHimanshu BariSr. Product Manager, HortonworksPage 1
  2. 2. © Hortonworks Inc. 2013TopicsPage 2Market trends &emergence ofHadoopHadoop’s roleand futuredirection in theEnterpriseEnterpriseHadoop usecases
  3. 3. © Hortonworks Inc. 2013Big Data & Big ImpactPage 3Big Data15xgrowth rate ofmachinegenerateddata by 2020Source: IDCBig Impact20%Percentage by whichcompanies leveraging datawill outperform their peers1.5M Data Savvy managersneededSource: Mckinsey
  4. 4. © Hortonworks Inc. 20132013: CIOs take note…Page 42013 STATE OF THE CIO SURVEY (Jan 2013)When it comes to adoption of Big Data:•  34% of IT executives surveyed classify their organization as late majority•  25% of IT executives surveyed classify their organization as laggards
  5. 5. © Hortonworks Inc. 2013What is Apache Hadoop?Page 5D D.... DC C CPetabyte scale reliable storagemanagement (HDFS) on commodity disksHighly distributed computation framework( MapReduce)Apache Hadoop = Open Source Data Management Software
  6. 6. © Hortonworks Inc. 2013Quick History: Hadoop at Yahoo!Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/Page 6
  7. 7. © Hortonworks Inc. 2013TopicsPage 7Market Trends& Emergence ofHadoopHadoop’s roleand futuredirection in theEnterpriseEnterpriseHadoop UseCases
  8. 8. © Hortonworks Inc. 2013Current Data ArchitectureAPPLICATIONS  DATA  SYSTEMS  TRADITIONAL  REPOS  RDBMS   EDW   MPP  DATA  SOURCES  OLTP,  POS  SYSTEMS  OPERATIONAL  TOOLS  MANAGE  &  MONITOR  Tradi:onal  Sources    (RDBMS,  OLTP,  OLAP)  DEV  &  DATA  TOOLS  BUILD  &  TEST  Business  Analy:cs  Custom  Applica:ons  Packaged  Applica:ons  Page 8ETL/ELT
  9. 9. © Hortonworks Inc. 2013Current Data Architecture PressuredAPPLICATIONS  DATA  SYSTEMS  TRADITIONAL  REPOS  RDBMS   EDW   MPP  DATA  SOURCES  OLTP,  POS  SYSTEMS  OPERATIONAL  TOOLS  MANAGE  &  MONITOR  Tradi:onal  Sources    (RDBMS,  OLTP,  OLAP)  New  Sources    85% data growth  (sen:ment,  clickstream,  geo,  sensor,  …)  DEV  &  DATA  TOOLS  BUILD  &  TEST  Business  Analy:cs  Custom  Applica:ons  Packaged  Applica:ons  Page 9ETL/ELT
  10. 10. © Hortonworks Inc. 2013Next generation data architectureAPPLICATIONS  DATA  SYSTEMS  TRADITIONAL  REPOS  RDBMS   EDW   MPP  DATA  SOURCES  OLTP,  POS  SYSTEMS  OPERATIONAL  TOOLS  MANAGE  &  MONITOR  Tradi:onal  Sources    (RDBMS,  OLTP,  OLAP)  New  Sources    85% data growth  (sen:ment,  clickstream,  geo,  sensor,  …)  DEV  &  DATA  TOOLS  BUILD  &  TEST  Business  Analy:cs  Custom  Applica:ons  Packaged  Applica:ons  ENTERPRISE  HADOOP  PLATFORM  Page 10ETL/ELT
  11. 11. © Hortonworks Inc. 2013New architecture enables schema on readPage 11OLD WAY HADOOP WAYDefine tablewithSchemaLoad onlytableconformingdataCHANGE?Fight for eternityLoadCOMPLETEdata inHadoopRead dataas you likeCHANGE?Just read differently
  12. 12. © Hortonworks Inc. 2013OS/VM   Cloud   Appliance  ENTERPRISE  HADOOP  PLATFORM  Evolution of Enterprise HadoopPage 12HADOOP    CORE  PLATFORM    SERVICES  DATA  SERVICES  S:nger  HIVE  &    HCATALOG  PIG   HBASE  SQOOP  FLUME  NFS  WebHDFS  HDFS  MAP  REDUCE  YARN      TEZ   OTHER  OPERATIONAL  SERVICES  OOZIE  AMBARI  FALCON  Enterprise ReadinessHigh Availability, DisasterRecovery, Rolling Upgrades,Security and SnapshotsKNOX  OpenStack  
  13. 13. © Hortonworks Inc. 2013YARN: General purpose resourcemanagement framework•  Why is it needed?–  New ways of data processing graph andstream processing have different resourcemanagement needs than mapreduce–  Need to improve scalability & utilization ofthe clusters–  Support multiple versions of mapreduce•  How does it work?–  Splits JobTracker responsibilities into aglobal resource manager and a per-application ApplicationMaster–  Provides an extendible framework HDFS  MapReduce  Redundant, Reliable StorageYARN:  Cluster  Resource  Management  Tez  Stream  Processing  Other  …Page 13HADOOPCORE
  14. 14. © Hortonworks Inc. 2013Apache Tez (“Speed”): Alternative toMapReduce• Why is it needed?– Widens the platform for Hadoop use cases beyond batch– Crucial to improving the performance of low-latency applications• Core idea-– Create a pool of pre-allocated containers–  Reuse containers for multiple tasksPage 14pluggableinputPluggableProcessorTaskPluggableOutputHADOOPCORE
  15. 15. © Hortonworks Inc. 2013Stinger: Improve Hive performanceand SQL compliancePage 15Improves existingtools & preservesinvestmentsEnable Hive tosupport interactiveworkloadsStinger ProjectSimple FocusQueryPlannerHiveExecutionEngineTez= 100X+New FileFormatORC file= SQL Compliance+DataTypesWindowing&Subqueries+DATASERVICES
  16. 16. © Hortonworks Inc. 2013Falcon: One-stop Shop for DataLifecycle management(DLM)Data Management Needs ToolsData Processing OozieReplication SqoopRetention DistcpScheduling FlumeReprocessing Map / ReduceMulti Cluster Management Hive and Pig JobsFalcon provides a single interface to orchestrate data lifecycle.Sophisticated DLM easily added to Hadoop applications.OPERATIONALSERVICESApache FalconProvides Orchestrates
  17. 17. © Hortonworks Inc. 2013Knox: Make Hadoop Security SimpleSimplify Security Aggregate Access Client AgilitySimplify security for both usersand operators.Deliver unified and centralizedaccess to the Hadoop cluster togive a ‘single application’ feelEnsure service users areabstracted from where servicesare located and how servicesare configured & scaledPLATFORMSERVICESHadoop ClusterAuthentication&VerificationClientUser StoreKDC, AD, LDAP{REST}!Knoxgatewaycluster
  18. 18. © Hortonworks Inc. 2013•  OpenStack provides operational agility and deployment choice•  Hadoop is a net new workload and a perfect app for OpenStack•  Integration marries two of the Largest Open Source Movements–  Community-driven innovation outpaces any single vendor–  Both are attracting major ecosystem players: IBM, RHT, HP, RAX, etc…Page 18Project Savanna to enable Hadoopon OpenStackCLOUDPLATFORMENABLEMENTProject SavannaAutomate deployment ofApache Hadoop onOpenStack
  19. 19. © Hortonworks Inc. 2013TopicsPage 19Market Trends& Emergence ofHadoopHadoop’s roleand futuredirection in theEnterpriseEnterpriseHadoop UseCases
  20. 20. © Hortonworks Inc. 2013Fundamental business drivers the same…• Better– Automation– Transparency– Segmentation– Innovation & experimentation• Faster– Everything• Cheaper– Across the value chainPage 20
  21. 21. © Hortonworks Inc. 20136 Common TYPES OF DATA1.  SentimentUnderstand how your customers feel about your brand andproducts – right now2.  ClickstreamCapture and analyze website visitors’ data trails andoptimize your website3.  Sensor/MachineDiscover patterns in data streaming automatically fromremote sensors and machines4.  GeographicAnalyze location-based data to manage operations wherethey occur5.  Server LogsResearch logs to diagnose process failures and preventsecurity breaches6.  TextUnderstand patterns in text across millions of web pages,emails, and documentsValuePage 21
  22. 22. © Hortonworks Inc. 2013Financial services• Industry specific drivers for Hadoop– Increasing compliance regulatory pressure– Bad guys never stop– Never ending Macroeconomic volatility– Cost pressures – more than ever– Extreme competition• Common use cases– Fraud & risk reduction( eg. During new account creation)– Sentiment based trading strategies– Improve insurance underwriting based on usage and longerhistory– Data reservoir ( for archival, compliance inquiries etc.)Page 22
  23. 23. © Hortonworks Inc. 2013Retail• Industry specific drivers for Hadoop– Increasingly ‘value sensitive’ and SMART consumer– Constant margin pressure– Emergence of the multi-channel approach• Common use cases– 360 degree customer view ( behavior, location, sentiment etc.)– Micro and dynamic segmentation– Optimizations – Price, assortment, layout, supply chain– Seasonal predictions – product styles, labor needs etc.Page 23
  24. 24. © Hortonworks Inc. 2013Telcos• Industry specific drivers for Hadoop– Infrastructure under stress with rise of smart devices– 4G/LTE investment needs CapEx but revenue growth largely flat– Cloud computing changing the game– Increasing competition with non Telcos– Sitting on a gold mine of data• Common use cases– Understanding customer behavior AND context (eg. locationbased) in real-time– Packaging and selling data– Call Detail Record (CDR) & extended data record (XDR) analysisfor service quality improvement & capacity planning/optimization– Customer churn analysis & preventionPage 24
  25. 25. © Hortonworks Inc. 2013Healthcare & Pharma• Industry specific drivers for Hadoop– Sudden data deluge– Health data initiative (HDI) by US Govt.– Digitization of health records and rise of sensor data– Huge accumulation of R&D data– Rising healthcare costs– Payors shift to outcome based payment models with providers aswell as pharmaceutical companies• Common use cases– Improved patient outcome tracking– Optimized patient recruitment for drug trials– Reduce drug modeling time– Improve insurance claim validation accuracyPage 25
  26. 26. © Hortonworks Inc. 2013Manufacturing• Industry specific drivers for Hadoop– Proliferation of sensors– Globalization of the supply chain– Ongoing miniaturization of products• Common use cases– Failure analysis to perform proactive maintenance– Improving equipment quality by more frequent sample testing andrigorous prototype testing– Supply chain optimizationPage 26
  27. 27. © Hortonworks Inc. 2013Hadoop Summit•  June 26-27, 2013- San Jose Convention Cntr•  Co-hosted by Hortonworks & Yahoo!•  Theme: Enabling the Next GenerationEnterprise Data Platform•  90+ Sessions and 7 Tracks:•  Community Focused Event–  Sessions selected by a Conference Committee–  Community Choice allowed public to vote forsessions they want to see•  Training classes offered pre event–  Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users–  Understanding Microsoft HDInsight and ApacheHadoop–  Developing Solutions with Apache Hadoop –HDFS and MapReduce–  Applying Data Science using Apache HadoopPage 27hadoopsummit.org
  28. 28. © Hortonworks Inc. 2013Thank YouFollow us: @hortonworksPage 28http://hortonworks.com/products/hortonworks-sandbox/
  29. 29. © Hortonworks Inc. 2012Similar solution architecture across usecasesPageHadoopLOADSQOOPFLUMEWebHDFSNFSUSEDBEDWMPPSOURCEDATA12345BATCHSTREAMINGSTORMMapReducePIGINTERACTIVEHIVE/SQLONLINEHBASEAMBARIHCATALOG (table metadata)PIG(dataprocessing)HIVE(data processing)compute&storage. . .. . .. .compute&storage..YARN