Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop: What It Is and What It's Not


Published on

The Briefing Room with Mark Madsen and Hortonworks …

The Briefing Room with Mark Madsen and Hortonworks
Slides from the Live Webcast on Oct. 16, 2012

The power of Hadoop cannot be denied, as evidenced by the fact that all the biggest closed-source vendors in the world of data management have embraced this open-source project with virtually open arms. But Hadoop is not a data warehouse, nor ever will it likely be. Rather, it's ideal role for now is to augment traditional data warehousing and business intelligence. As an adjunct, Hadoop provides an amazing mechanism for storing and analyzing Big Data. The key is to manage expectations and move forward carefully.

Check out this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature, who will explain how, where, when and why to leverage the open-source elephant in the enterprise. He'll be briefed by Jim Walker of Hortonworks who will tout his company's vision for the future of Big Data management. He'll provide details on their data platform and how it can be used to complete the picture of information management. He'll also discuss how the Hortonworks partner network can help companies get big value from Big Data.


Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Eric.kavanagh@bloorgroup.comTwitter Tag: #briefr The Briefing Room
  • 2. !   Reveal the essential characteristics of enterprise software, good and bad !   Provide a forum for detailed analysis of today s innovative technologies !   Give vendors a chance to explain their product to savvy analysts !   Allow audience members to pose serious questions... and get answers!Twitter Tag: #briefr The Briefing Room
  • 3. !  November: Cloud !  December: Innovators !  January: Big Data !  February: Performance !  March: IntegrationTwitter Tag: #briefr The Briefing Room
  • 4. !  The Data Warehouse was once considered the Holy Grail of Business Intelligence, but as data volumes increase exponentially, we’re finding that data warehousing cannot be all things for all users. ! Hadoop was initially developed at Yahoo! to support a search engine project and has since turned into the poster child for open source Big Data processing. !  While Hadoop is not a data warehouse, its capabilities can help organizations store and analyze huge volumes of data.Twitter Tag: #briefr The Briefing Room
  • 5. Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor at Forbes Online and Information Management. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.netTwitter Tag: #briefr The Briefing Room
  • 6. ! Hortonworks is an enterprise software company that focuses on the development and support of Apache Hadoop. !  Its product is the Hortonworks Data Platform, an open source platform for storing, processing and analyzing large volumes of data from many sources and in a variety of formats. ! Hortonworks recently introduced its Hive ODBC Driver 1.0, which allows users to integrate its Hadoop platform with the BI apps running on top.Twitter Tag: #briefr The Briefing Room
  • 7. Jim is the Director of Product Marketing at Hortonworks. He is a recovering developer, professional marketer and amateur photographer with nearly twenty years experience building products and developing emerging technologies. During his career, he has brought multiple  products to market in a variety of fields, including data loss prevention, master data management and now big data.  At Hortonworks, Jim is focused on accelerating the development and adoption of Apache Hadoop.Twitter Tag: #briefr The Briefing Room
  • 8. Hadoop: What It Is & Isn’tOctober 2012Jim WalkerDirector, Product MarketingHortonworks© Hortonworks Inc. 2012 Page 9
  • 9. Big Data: Organizational Game Changer Transactions + InteractionsPetabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity Page 10 © Hortonworks Inc. 2012
  • 10. What is a Data Driven Business? •  DEFINITION Better use of available data in the decision making process •  RULE Key metrics derived from data should be tied to goals •  PROVEN RESULTS Firms that adopt Data-Driven Decision Making have output and productivity that is 5-6% higher than what would be expected given their investments and usage of information technology*1110010100001010011101010100010010100100101001001000010010001001000001000100000100010010010001000010111000010010001000101001001011110101001000100100101001010010011111001010010100011111010001001010000010010001010010111101010011001001010010001000111 * “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” Brynjolfsson, Hitt and Kim (April 22, 2011) Page 11 © Hortonworks Inc. 2012
  • 11. Big Data: Optimize Outcomes at Scale Media optimize Content Intelligence optimize Detection Finance optimize Algorithms Advertising optimize Performance Fraud optimize PreventionRetail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 12 © Hortonworks Inc. 2012
  • 12. Enterprise Big Data Flows Unstructured Business CRM, ERP Data Transactions Web, Mobile & Interactions Point of sale Log files Big Data Platform Exhaust Data Classic Data Integration & ETL Social Media Sensors, devices Business Dashboards, Intelligence Reports, & Analytics Visualization, … DB data Capture Big Data Process Distribute Results Feedback1 Collect data from all sources structured &unstructured 2 Transform, refine, aggregate, analyze, report 3 Interoperate and share data with applications/analytics 4 Use operational data w/in big data platform, preserve data Page 13 © Hortonworks Inc. 2012
  • 13. Data Platform Requirements for Big Data Data Platform for Big Data Capture Process Exchange •  Collect data from all •  Transform, refine, •  Deliver data with sources - structured and aggregate, analyze, enterprise data systems unstructured data report •  Share data with analytic •  all speeds batch, async, applications and streaming, real-time processing Operate •  Provision, monitor, diagnose, manage at scale •  Reliability, availability, affordability, scalability, interoperability Across all deployment models Operating Virtual Cloud Big Data Systems Platforms Platforms Appliances Page 14 © Hortonworks Inc. 2012
  • 14. Apache Hadoop & Big Data Use Cases Big Data Transactions, Interactions, Observations Refine Explore Enrich Business Case Page 15 © Hortonworks Inc. 2012
  • 15. Operational Data RefineryHadoop as platform for ETL modernization Refine Explore EnrichUnstructured Log files DB data Capture •  Capture new unstructured data along with log files all alongside existing sources •  Retain inputs in raw form for audit and Capture and archive continuity purposes Parse & Cleanse Process Structure and join •  Parse the data & cleanse Upload •  Apply structure and definition Refinery •  Join datasets together across disparate data sources Exchange •  Push to existing data warehouse for downstream consumption Enterprise •  Feeds operational reporting and online systems Data Warehouse Page 16 © Hortonworks Inc. 2012
  • 16. Big Data Exploration & Visualization Hadoop as agile, ad-hoc data mart Refine Explore Enrich Unstructured Log files DB data Capture •  Capture multi-structured data and retain inputs in raw form for iterative analysis Capture and archive Process •  Parse the data into queryable format Structure and join •  Explore & analyze using Hive, Pig, Mahout and Categorize into tables other tools to discover value upload JDBC / ODBC •  Label data and type information for compatibility and later discovery Explore •  Pre-compute stats, groupings, patterns in dataOptional to accelerate analysis Exchange •  Use visualization tools to facilitate exploration and find key insights Visualization EDW / Datamart Tools •  Optionally move actionable insights into EDW or datamart Page 17 © Hortonworks Inc. 2012
  • 17. Application EnrichmentDeliver Hadoop analysis to online apps Refine Explore EnrichUnstructured Log files DB data Capture •  Capture data that was once too bulky and unmanageable Capture Enrich Parse Process Derive/Filter •  Uncover aggregate characteristics across data Scheduled & near real time •  Use Hive Pig and Map Reduce to identify patterns NoSQL, HBase •  Filter useful data from mass streams (Pig) Low Latency •  Micro or macro batch oriented schedules Exchange •  Push results to HBase or other NoSQL alternative for real time delivery Online •  Use patterns to deliver right content/offer to the Applications right person at the right time Page 18 © Hortonworks Inc. 2012
  • 18. Hadoop in Enterprise Data Architectures Existing Business Infrastructure Web New Tech Datameer Tableau Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Latency/ Tools EDW NoSQL Custom Existing Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Big Data Sources (transactions, observations, interactions) Page 19 © Hortonworks Inc. 2012
  • 19. Where Does It Fit into Your Business? Vertical Refine Explore Enrich •  Dynamic Pricing •  Log Analysis/Site Retail & Web •  Social Network Analysis •  Session & Content Optimization Optimization •  Loyalty Program •  Dynamic Pricing/Targeted Retail •  Brand and Sentiment Analysis Optimization Offer Intelligence •  Threat Identification •  Person of Interest Discovery •  Cross Jurisdiction Queries •  Risk Modeling & Fraud •  Surveillance and Fraud Identification •  Real-time upsell, cross sales Finance •  Trade Performance Detection marketing offers •  Customer Risk Analysis Analytics •  Smart Grid: Production •  Grid Failure Prevention Energy •  Individual Power Grid Optimization •  Smart Meters •  Dynamic DeliveryManufacturing •  Supply Chain Optimization •  Customer Churn Analysis •  Replacement parts Healthcare & •  Electronic Medical Records •  Clinical Trials Analysis •  Insurance Premium Payer (EMPI) Determination Page 20 © Hortonworks Inc. 2012
  • 20. Hortonworks Vision & Leadership We believe that by the end of 2015, more than half the worlds data will be processed by Apache Hadoop. Trusted Open Innovative•  Stewards of core Hadoop •  100% open platform •  Innovating current platform•  Original builders and •  No POS holdback with HCatalog, Ambari, HA operators of Hadoop •  Open to the Hadoop •  Innovating future platform•  100+ years Hadoop community with YARN, HA development experience •  Open to the Hadoop •  Complete vision for•  Managed every viable, ecosystem Hadoop-based platform stable Hadoop release •  Closely aligned to •  Enable the Hadoop•  HDP built on Hadoop 1.0 Hadoop core ecosystem Page 21 © Hortonworks Inc. 2012
  • 21. Hortonworks Data Platform •  Simplify deployment to get started quickly and easily •  Monitor, manage any size cluster with familiar console and tools 1 •  Only platform to include data integration services to interact with any data •  Metadata services opens the platform for integration with existing applications •  Dependable high availability architectureü  Reduce risks and cost of adoptionü  Lower the total cost to administer and provision •  Tested at scale to future proof your cluster growthü  Integrate with your existing ecosystem Page 22 © Hortonworks Inc. 2012
  • 22. Twitter Tag: #briefr The Briefing Room
  • 23. “In  pioneer  days  they  used  oxen  for  heavy  pulling,  and   when  one  ox  couldnt  budge  a  log,  they  didnt  try  to   grow  a  larger  ox.  We  shouldnt  be  trying  for  bigger   computers,  but  for  more  systems  of  computers.”    Grace  Hopper  © Third Nature Inc.
  • 24. What’s  different  today?   We’re  not  ge@ng  more  CPU   speed,  but  more  CPU  cycles.   There  are  too  many  CPUs   relaEve  to  other  resources,   creaEng  an  imbalance  in   hardware  plaForms.   We  therefore  use  nodes  to   aggregate  memory,  network   bandwidth  and  IOPs.   Most  soJware  is  designed  for   a  single  worker,  not    high   degrees  of  parallelism  and   won’t  scale  well.  © Third Nature Inc.
  • 25. Data  volume  is  the  oldest,  easiest  problem  © Third Nature Inc. Teradata
  • 26. Analy:cs  makes  the  data  volume  problem  bigger   Many  of  the  processing  problems  are  O(n2)  or  worse,  so   moderate  data  can  be  a  problem  for  DW  architectures  © Third Nature Inc.
  • 27. I need that It would be logical data now. to keep all the It will take.   data in one place. 6 months       A  common  problem  with  new  projects  or  © Third Nature Inc. unexpected  business  problems…  
  • 28. The  proposed  solu:on?  Load  Hadoop  and  analyze  © Third Nature Inc.
  • 29. Welcome  to  the  Hadoop  schema!   Why  soJ  /  no  schema  can  be  good:   Easier  programming   Easier  modeling  since  you  don’t  have  to  be  perfect  in  advance,  and   it’s  change-­‐resilient   Join  eliminaEon  =  I/O  savings  (if  no  updates)  © Third Nature Inc.
  • 30. Whether  to  switch  from  a  DB  isn’t  the  right  discussion   SQL? Hadoop SQL! SQL SQL.. .© Third Nature Inc.
  • 31. Strategy:  There’s  a  pony  in  there  somewhere  © Third Nature Inc.
  • 32. …but  you  need  a  unicorn  to  find  the  pony  © Third Nature Inc.
  • 33. Ques:ons  for  discussion   1. Is  scale  of  data  really  that  much  of  a  problem  for  most   organizaEons?   2. Hadoop  is  designed  for  batch  work  –  how  good  is  it  for   interacEve  use?  Real-­‐Eme  use  cases?   3. How  do  you  define  “plaForm”?   4. ETL  modernizaEon  is  menEoned,  but  isn’t  this  a  reversion   to  manual  coding?   5. How  do  you  design  for  long-­‐term  use  rather  than  one-­‐off   analysis  projects?   6. Does  open  source  really  macer  for  this  part  of  the  stack?  © Third Nature Inc.
  • 34. CC  Image  AOribu:ons   Thanks  to  the  people  who  supplied  the  creaEve  commons  licensed  images  used  in  this  presentaEon:     Phone  dump  -­‐  Richard  Barnes   ponies  in  field.jpg  -­‐  hcp://    © Third Nature Inc.
  • 35. Twitter Tag: #briefr The Briefing Room
  • 36. !  This Month: Database !  November: Cloud !  December: Innovators !  January: Big Data !  2013 Editorial Calendar ( Tag: #briefr The Briefing Room
  • 37. Twitter Tag: #briefr The Briefing Room