201305 hadoop jpl-v3

  • 1,673 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,673
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
60
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
  • http://grist.files.wordpress.com/2006/11/csxt_southbound_freight_train.jpghttp://businessguide.rw/main/gallery/Fedex-Fleet.jpg
  • Notes… credit http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Hadoop-Cluster.PNG
  • As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
  • Hadoop started to enhance SearchScience clusters launched in 2006 as early proof of conceptScience results drive new applications -> becomes core Hadoop business
  • At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
  • Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private systemFrom YST
  • I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
  • Archival use case at big bank:10K files a day == 400GBNeed to store all in EBCDIC format for complianceNeed to also convert to Hadoop for analyticsCompute a checksum for every record and keep a tally of which primary keys changed each dayAlso, bring together financial, customer, and weblogs for new insightsShare with Palantir, Aster Data, Vertica, Teradata, and more…Step One: Create tables or partitionsIn Step one of the dataflow the mainframe or another orchestration and control program notifies HCatalog of its intention to create a table or add a partition if the table exists. This would use standard SQL data definition language (DDL) such as CREATE TABLE and DESCRIBE TABLE (see http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html#HCatalog+DDL). Multiple tables need to be created though. Some tables are job-specific temporary tables while other tables need to be more permanent. Raw format storage data can be stored in an HCat table, partitioned by some date field (month or year, for example). The staged record data will most certainly be stored in HCatalog partitioned by month (see http://incubator.apache.org/hcatalog/docs/r0.4.0/dynpartition.html). Then any missing month in the table can be easily detected and generated from the raw format storage on the fly. In essence, HCatalog allows creation of tables which up-levels this architectural challenge from one of managing a bunch of manually created files and a loose naming convention to a strong yet abstract table structure much like a mature database solution would have.Step Two: Parallel IngestBefore or after tables are defined in the system, we can start adding data to the system in parallel using WebHDFS or DistCP. In the Teradata-Hortonworks Data Platform these architectural components work seamlessly with the standard HDFS namenode to notify DFS clients of all the datanodes to which to write data. For example, a file made up of 10,000 64 megabyte blocks could be transferred to a 100-node HDFS cluster using all 100 nodes at once. By asking WebHDFS for the write locations for each block, a multi-threaded or chunking client application could write each 64MB block in parallel, 100 blocks or more at a time, effectively dividing the 10,000-block into 100 waves of copying. 100 copy waves would complete 100 times faster than 10,000 one-by-one block copies. Parallel ingest with HCatalog, WebHDFS and/or DistCP will lead to massive speed gains.Critically, the system can copy chunked data directly into partitions in pre-defined tables in HCatalog. This means that each month, staged record data can join the staging tables without dropping previous months and staged data can be partitioned by month while each month itself is loaded using as many parallel ingest servers as solution architecture desires to balance cost with performance.Step Three: Notify on UploadNext, the Parallel ingest system needs to notify the HCatalog engine the files have been uploaded and, simultaneously, any end user transformation or analytics workload waiting for the partition need to be notified that the file is ready to support queries. By “ready” we mean that the partition is whole and is completely copied into HDFS. HCatalog has built in blocking and non-blocking notification APIs that use standard message buses to notify any interested parties that workload—be it MapReduce or HDFS copy work—is complete and valid (see: http://incubator.apache.org/hcatalog/docs/r0.4.0/notification.html). The way this system works is any job created through HCatalog is acknowledged with an output location. The messaging system later replies that a job is complete and since, when the job was submitted, the eventual output location was returned, the calling application can immediately return to the target output file and find its needed data. In this next-gen ETL use case, we will be using this notification system to immediately fire a Hive job to begin transformation whenever a partition is added to the raw or staged data tables. This will make the construction of systems that depend on these transformations easier in that these systems needn’t poll for data nor do those dependent systems need to hard-code file locations for sources and sinks of data moving through the dataflow.Step Four: Fire Off UDFsSince HCatalog can notify interested parties in the completion of file I/O tasks, and since Hcatalog stores file data underneath abstracted table and partition names and locations, invoking the core UDFs that transform mainframe’s data into standard SQL data types can be programmatic. In other words, when a partition is created and the data backing it fully loaded into HDFS, a persistent Hive client can wake up, being notified of the new data and grab that data to load into Teradata. Step Five: Invoke Parallel Transport (Q1, 2013)Coming in the first quarter of 2013 or soon thereafter, Teradata and Hortonworks Data Platform will communicate using Teradata’s parallel transport mechanism. This will provide the same performance benefits as parallel ingest but for the final step in the dataflow. For now, systems integrators and/or Teradata and Hortonworks team members can implement a few DFS clients to load chunks or segments of the table data into Teradata in parallel.
  • Example: hi tech surveys, customer sat and product satSurveys have multiple-choice and freeformInput and analyze the plain-text sectionsJoin cross-channel support requests and device telemetry back to customerAnother example: wireless carrier and “golden path”
  • Example: retail custom homepageClusters of related productsSet up models in Hbase that influence when user behaviors trigger recommendationsOR inform users when they enter of custom recommendations
  • Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  • Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  • Buzz about low latency access in Hadoop
  • Attribute: http://www.flickr.com/photos/adavey/2919843490/sizes/o/in/photostream/
  • Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
  • But beyond Core Hadoop, Hortonworkers are also deeply involved in the ancillary projects that are necessary for more general usage.As you can see, in both code count as well as committers, we contribute more than any others to Core Hadoop. And for the other key projects such as Pig, Hive, Hcatalog, Ambari we are doing the same.This community leadership across both core Hadoop and also the related open source projects is crucial in enabling us to play the critical role in turning Hadoop into Enterprise Hadoop.
  • So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
  • As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring

Transcript

  • 1. © Hortonworks Inc. 2013Apache Hadoop for Big ScienceHistory, Use cases & FuturesEric Baldeschwieler, “Eric14”Hortonworks CTO@jeric14
  • 2. © Hortonworks Inc. 2013Agenda• What is Apache Hadoop• Project motivation & history• Use cases• Futures and observations
  • 3. © Hortonworks Inc. 2013Page 3What is Apache Hadoop?
  • 4. © Hortonworks Inc. 2013Traditional data systems vs. HadoopTraditional data systems–Limited scaling options–Expensive at scale– Complex components– Proprietary software– Reliability in Hardware–Optimized for latency, IOPsPage 4Hadoop Cluster–Low cost scale-out– Commodity components– Open source software– Reliability in software–Optimized for throughputWhen your data infrastructure does not scale … Hadoop
  • 5. © Hortonworks Inc. 2013StorageApache Hadoop: Big Data PlatformOpen Source data managementwith scale-out storage &distributed processingPage 5HDFS• Distributed across a cluster• Natively redundant, self-healing• Very high bandwidthProcessingMap Reduce• Splits a job into small tasks andmoves compute “near” the data• Self-Healing• Simple programming modelKey Characteristics• Scalable– Efficiently store and processpetabytes of data– Scale out linearly by adding nodes(node == commodity computer)• Reliable– Data replicated 3x– Failover across nodes and racks,• Flexible– Store all types of data in any format• Economical– Commodity hardware– Open source software (via ASF)– No vendor lock-in
  • 6. © Hortonworks Inc. 2013 (From Richard McDougall, VMware, Hadoop Summit, 2012 talk)Hadoop’s cost advantageSAN Storage$2 - $10/Gigabyte$1M gets:0.5Petabytes1,000,000 IOPS1Gbyte/secNAS Filers$1 - $5/Gigabyte$1M gets:1 Petabyte400,000 IOPS2Gbyte/secLocal Storage$0.05/Gigabyte$1M gets:20 Petabytes10,000,000 IOPS800 Gbytes/sec
  • 7. © Hortonworks Inc. 2013Hadoop hardware• 10 to 4500 nodeclusters–1-4 “master nodes”–Interchangeable workers• Typical node–4-12 * 2-4TB SATA–64GB RAM–2 * 4-8 core, ~2GHz–2 * 1Gb NICs–Single power supply–jBOD, not RAID, …• Switches–1-2 Gb to the node–~20 Gb to the core–Full bisection bandwidth–Layer 2 or 3, simplePage 7
  • 8. © Hortonworks Inc. 2013ApplianceCloudOS / VMZooming out: An Apache Hadoop PlatformHORTONWORKSDATA PLATFORM (HDP)PLATFORM SERVICESHADOOP COREEnterprise Readiness:HA, DR, Snapshots, Security, …DistributedStorage & ProcessingHDFSMAP REDUCEDATASERVICESStore, Process and AccessDataHCATALOGHIVEPIGHBASESQOOPFLUMEOPERATIONALSERVICESManage &Operate atScaleOOZIEAMBARI
  • 9. © Hortonworks Inc. 2013Zooming out: A Big Data ArchitecturePage 9APPLICATIONSDATASYSTEMSTRADITIONAL REPOSRDBMS EDW MPPDATASOURCESMOBILEDATAOLTP,POSSYSTEMSOPERATIONALTOOLSMANAGE &MONITORTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)DEV & DATATOOLSBUILD &TESTBusinessAnalyticsCustomApplicationsPackagedApplicationsHORTONWORKSDATA PLATFORM
  • 10. © Hortonworks Inc. 2013Motivation and History2007 2008 200912010The Datagraph Blo
  • 11. © Hortonworks Inc. 2013Eric Baldeschwieler - CTO HortonworksPage 11• 2011 – now Hortonworks - CTO• 2006 – 2011 Yahoo! - VP Engineering, Hadoop• 2003 – 2005 Yahoo! – Web Search Engineering- Built systems that crawl & index the web• 1996 – 2003 Inktomi – Web Search Engineering- Built systems that crawl & index the web• Previously– UC Berkeley – Masters CS– Video Game Development– Digital Video & 3D rendering software– Carnegie Mellon – BS Math/CS
  • 12. © Hortonworks Inc. 2013Early history• 1995 – 2005–Yahoo! search team builds 4+ generations of systems to crawl &index the WWW. 20 Billion pages!• 2004–Google publishes Google File System & MapReduce papers• 2005–Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!–Yahoo! search commits to build open source DFS & MapReduce– Compete / Differentiate via Open Source contribution!– Attract scientists – Become known center of big data excellence– Avoid building proprietary systems that will be obsolesced– Gain leverage of wider community building one infrastructure• 2006–Hadoop is born!– Dedicated team under E14 staffed at Yahoo!– Nutch prototype used to seed new Apache Hadoop project
  • 13. © Hortonworks Inc. 2013Hadoop at Yahoo!Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/
  • 14. © Hortonworks Inc. 2013Hortonworks – 100% Open SourcePage 14• We distribute the only 100%Open Source EnterpriseHadoop Distribution:Hortonworks DataPlatform• We engineer, test & certifyHDP for enterprise usage• We employ the corearchitects, builders andoperators of Apache Hadoop• We drive innovation withinApache SoftwareFoundation projects• We are uniquely positionedto deliver the highest qualityof Hadoop support• We enable the ecosystem towork better with HadoopDevelop Distribute SupportWe develop, distribute and supportthe ONLY 100% open sourceEnterprise Hadoop distributionEndorsed by Strategic PartnersHeadquarters: Palo Alto, CAEmployees: 200+ and growingInvestors: Benchmark, Index, Yahoo
  • 15. twice the engagementCASE STUDYYAHOO SEARCH ASSIST™15© Yahoo 2011Before Hadoop After HadoopTime 26 days 20 minutesLanguage C++ PythonDevelopment Time 2-3 weeks 2-3 days• Database for Search Assist™ is built using Apache Hadoop• Several years of log-data• 20-steps of MapReduce15
  • 16. , early adoptersScale and productize HadoopApache HadoopEcosystem History2006 – presentWide AdoptionFunds further development, enhancements2011 – presentOther Internet CompaniesAdd tools / frameworks, enhanceHadoop2008 – present…16Service ProvidersProvide training, support, hosting2010 – present…Cloudera, MapRMicrosoftIBM, EMC, Oracle
  • 17. © Hortonworks Inc. 2013Use casesThe “New” Na
  • 18. © Hortonworks Inc. 2013Page 18
  • 19. © Hortonworks Inc. 2013Use-case: Full genome sequencing• The data–1 full genome = 1TB (raw uncompressed)–1M people sequenced = 1 Exabyte–Cost per 1 person = $1000 and continues to drop• Uses for Hadoop:–Large scale compute applications:– Map NGS data (“reads”) to a reference genome– Used for drug development, personalized treatment– Community developed Hadoop-based software for gene matching:cloudburst, crossbow–Store, manage and share genomics data in the bio-informaticscommunityPage 19See: http://hortonworks.com/blog/big-data-in-genomics-and-cancer-treatment
  • 20. © Hortonworks Inc. 2013Use-case: Oil & gas• Digital Oil Field:–Data sizes: 2+ TB / day–Application: safety/security, improve field performance–Hadoop used for data storage and analytics• Seismic image processing:–Drill ship costs $1M/day–One “shot” (in SEGY format) contains ~2.5GB–Hadoop used to parallelize computation and store data post-processingPage 20–Previously data discardedimmediately after processing!– Now kept for reprocessing– Research & Development
  • 21. © Hortonworks Inc. 2013Use-case: high-energy physics• Collecting events from colliders–“We have a very big digital camera”; each “event” = ~1MB–Looking for rare events (need millions of events for stat significance)• Typical task: scan through events and look for particles witha certain mass–Analyze millions of events in parallel–Hadoop used in streaming with C++ code to analyze events• HDFS used for low cost storagePage 21http://www.linuxjournal.com/content/the-large-hadron-collider
  • 22. © Hortonworks Inc. 2013Use-case: National Climate Assessment• Rapid, Flexible, and Open SourceBig Data Technologies for the U.S.National Climate Assessment–Chris A. Mattmann–Senior Computer Scientist, NASA JPL–Chris and team have done a number ofprojects with Hadoop.• Goal–Compare regional climate models to avariety of satellite observations–Traditionally models are compared toother models, not to actual observations–Normalize data complex multi-format datato lat/long + observation values• Hadoop–Used Apache Hive to provide Scale-outSQL warehouse of the data–See paper or case study in“Programming Hive – O’Reilly 2012”Credit: Kathy JacobsThe “New” NationGoal• EntochVision• Adsucothe
  • 23. © Hortonworks Inc. 2013Big DataTransactions + Interactions + ObservationsApache Hadoop: Patterns of UsePage 23Refine Explore Enrich
  • 24. © Hortonworks Inc. 2013EnterpriseData WarehouseOperational Data RefineryHadoop as platform for ETL modernizationCapture• Capture new unstructured data along with logfiles all alongside existing sources• Retain inputs in raw form for audit andcontinuity purposesProcess• Parse the data & cleanse• Apply structure and definition• Join datasets together across disparate datasourcesExchange• Push to existing data warehouse fordownstream consumption• Feeds operational reporting and online systemsPage 24Unstructured Log filesRefineryStructure and joinCapture and archiveParse & CleanseRefine ExploreEnrichDB dataUpload
  • 25. © Hortonworks Inc. 2013VisualizationToolsEDW / DatamartExploreBig Data ExplorationHadoop as agile, ad-hoc data martCapture• Capture multi-structured data and retain inputsin raw form for iterative analysisProcess• Parse the data into queryable format• Explore & analyze using Hive, Pig, Mahout andother tools to discover value• Label data and type information forcompatibility and later discovery• Pre-compute stats, groupings, patterns in datato accelerate analysisExchange• Use visualization tools to facilitate explorationand find key insights• Optionally move actionable insights into EDWor datamartPage 25Capture and archiveupload JDBC / ODBCStructure and joinCategorize into tablesUnstructured Log files DB dataRefine Explore EnrichOptional
  • 26. 31-Mar-2013 NCAR-SEA-2013 26
  • 27. © Hortonworks Inc. 2013OnlineApplicationsEnrichApplication EnrichmentDeliver Hadoop analysis to online appsCapture• Capture data that was oncetoo bulky and unmanageableProcess• Uncover aggregate characteristics across data• Use Hive Pig and Map Reduce to identify patterns• Filter useful data from mass streams (Pig)• Micro or macro batch oriented schedulesExchange• Push results to HBase or other NoSQL alternativefor real time delivery• Use patterns to deliver right content/offer to theright person at the right timePage 27Derive/FilterCaptureParseNoSQL, HBaseLow LatencyScheduled &near real timeUnstructured Log files DB dataRefine Explore Enrich
  • 28. twice the engagementCASE STUDYYAHOO! HOMEPAGE28Personalizedfor each visitorResult:twice the engagement+160% clicksvs. one size fits all+79% clicksvs. randomly selected+43% clicksvs. editor selectedRecommended links News Interests Top Searches© Yahoo 2011 28
  • 29. CASE STUDYYAHOO! HOMEPAGE29• Serving Maps• Users - Interests• Five MinuteProduction• WeeklyCategorizationmodelsSCIENCEHADOOPCLUSTERSERVING SYSTEMSPRODUCTIONHADOOPCLUSTERUSERBEHAVIORENGAGED USERSCATEGORIZATIONMODELS (weekly)SERVINGMAPS(every 5 minutes)USERBEHAVIOR» Identify user interestsusing Categorizationmodels» Machine learning to buildever better categorizationmodelsBuild customized home pages with latest data (thousands / second)© Yahoo 2011 29
  • 30. © Hortonworks Inc. 2013Futures & observations
  • 31. © Hortonworks Inc. 2013Hadoop 2.0 Innovations - YARNHDFSMapReduceRedundant, Reliable Storage• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation• Next generation execution– Improves MapReduce performance• Supports new frameworks beyondMapReduce– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, Giraph
  • 32. © Hortonworks Inc. 2013Hadoop 2.0 Innovations - YARN• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation• Next generation execution– Improves MapReduce performance• Supports new frameworks beyondMapReduce– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, GiraphHDFSMapReduceRedundant, Reliable StorageYARN: Cluster Resource ManagementTezStreamingOther
  • 33. © Hortonworks Inc. 2013Stinger Initiative• Community initiative around Hive• Enables Hive to support interactive workloads• Improves existing tools & preserves investmentsQueryPlannerHiveExecutionEngineTez= 100X+ +FileFormatORC file
  • 34. © Hortonworks Inc. 2013Data Lake Projects• Keep raw data–20+ PB projects–Previously discarded• Unify may data sources–Pull from all over organization• Produce derived views–Automatic “ETL” for regulardownstream use cases–New applications from unified data• Support ad hoc exploration–Prototype new use cases–Answer unanticipated questions–Agile rebuild from raw datacstageCore-generalarchiveLandingzone(NFS, JMS)IngestDescriptorCore-securecbabaData flow described indescriptor docs
  • 35. © Hortonworks Inc. 2013Interesting things on the Horizon• Solid state storage and disk drive evolution–So far LFF drives seem to be maintaining their economicadvantage (4TB drives now & 7TB! Next year)–SSDs are becoming ubiquitous and will become part of thearchitecture• In RAM databases–Bring them on, let’s port them to Yarn!–Hadoop complements these technologies, shines w huge data• Atom / ARM processors–This is great for Hadoop! But…–Vendors are not yet designing the right machines (bandwidth todisk)• Software Defined Networks–This is great for Hadoop, more network for less!
  • 36. © Hortonworks Inc. 2013Thank You!Eric BaldeschwielerCTO HortonworksTwitter @jeric14ApacheFoundationNew UsersContributions&ValidationGet Involved!
  • 37. © Hortonworks Inc. 2013See Hadoop > Learn Hadoop > Do HadoopFull environmentto evaluateHadoopHands onstep-by- steptutorials to learn
  • 38. © Hortonworks Inc. 2013STOP!Bonus material follows
  • 39. © Hortonworks Inc. 2013Hortonworks ApproachIdentify and introduce enterpriserequirements into the pubic domainWork with the community to advance andincubate open source projectsApply Enterprise Rigor to provide the moststable and reliable distributionCommunity Driven Enterprise Apache Hadoop
  • 40. © Hortonworks Inc. 2013Driving Enterprise Hadoop InnovationPage 400% 20% 40% 60% 80% 100%AMBARIHBASEHCATALOGHIVEPIGHADOOPCORELines Of Code By CompanySource: Apache Software FoundationHortonworks Yahoo!Cloudera OtherHortonworksCommittersClouderaCommitters19 95 11 05 03 714 0
  • 41. © Hortonworks Inc. 2013Hortonworks Process for Enterprise HadoopPage 41Upstream Community Projects Downstream Enterprise ProductHortonworksData PlatformDesign &DevelopDistributeIntegrate& TestPackage& CertifyApacheHCatalogApachePigApacheHBaseOtherApacheProjectsApacheHiveApacheAmbariApacheHadoopTest &PatchDesign & DevelopReleaseNo Lock-in: Integrated, tested & certified distribution lowersrisk by ensuring close alignment with Apache projectsVirtuous cycle when development & fixed issues done upstream & stable project releases flow downstreamStable ProjectReleasesFixed Issues
  • 42. © Hortonworks Inc. 2013Hadoop and Cloud• Can I run hadoop in Open stack or in my virtualizationinfrastructure?–Yes, but… it depends on your use-case and hardware choices–We will see a lot of innovation in this space in coming years– Openstack Savanna – Collaboration to bring Hadoop to Openstack• Zero procurement POC – Try Hadoop in cloud–5-10 nodes – works great! (On private or public cloud)–Many projects are done today in public clouds• Occasional use (run Hadoop when cluster not busy)–Where do you store the data when Hadoop is not running?–>20 nodes  review your network and storage design• Large scale, continuous deployment 100 – 4000 nodes–Need to design your storage and network for HadoopPage 42
  • 43. © Hortonworks Inc. 2013BI – Jaspersoft, Pentaho, …NoSQL in Apps – HBase, Cassandra, MangoDB, …Search Apps – ElasticSearch, Solr, …Open source in the ArchitecturePage 43APPLICATIONSDATASYSTEMSDBs – Postgres, MySQLSearch – ElasticSearch, Solr, …DATASOURCESOPERATIONALTOOLSDEV & DATATOOLSHORTONWORKSDATA PLATFORMEclipse, OpenJDK,Spring, VirtualBox…Nagios, Ganglia, Chef, Puppet…DBsSearch ReposESB, ETL – ActiveMQ, Talend, Kettle
  • 44. twice the engagementCASE STUDYYAHOO! WEBMAP44© Yahoo 2011 What is a WebMap?• Gigantic table of information about every web site,page and link Yahoo! knows about• Directed graph of the web• Various aggregated views (sites, domains, etc.)• Various algorithms for ranking, duplicate detection,region classification, spam detection, etc. Why was it ported to Hadoop?• Custom C++ solution was not scaling• Leverage scalability, load balancing and resilience ofHadoop infrastructure• Focus on application vs. infrastructure44
  • 45. twice the engagementCASE STUDYWEBMAP PROJECT RESULTS45© Yahoo 2011 33% time savings over previous system on thesame cluster (and Hadoop keeps gettingbetter) Was largest Hadoop application, drove scale• Over 10,000 cores in system• 100,000+ maps, ~10,000 reduces• ~70 hours runtime• ~300 TB shuffling• ~200 TB compressed output Moving data to Hadoop increased number ofgroups using the data45
  • 46. © Hortonworks Inc. 2013Use-case: computational advertising• A principled way to find “best match” ads, in context, for aquery (or page view)• Lots of data:–Search: billions of unique queries per hour–Display: Trillions of ads displayed per hour–Billions of users–Billions of ads• Big business:–$132B total advertising market (2015)–$600B total worldwide market (2015)• Challenges:–A huge number of small transactions–Cost of serving < revenue per searchPage 46
  • 47. Example: predicting CTR (search ads)Rank = bid * CTRPredict CTR for each ad todetermine placement, based on:- Historical CTR- Keyword match- Etc…Approach: supervised learning
  • 48. © Hortonworks Inc. 2013Hadoop for advertising science @ Yahoo!• Advertising science moved CTR prediction from “legacy”(MyNA) systems to Hadoop–Scientist productivity dramatically improved–Platform for massive A/B testing for computational advertisingalgorithmic improvements• Hadoop enabled next-gen contextual advertising matchingplatform–Heavy compute process that is highly parallelizablePage 48
  • 49. MapReduce• MapReduce is a distributed computing programming model• It works like a Unix pipeline:– cat input | grep | sort | uniq -c > output– Input | Map | Shuffle & Sort | Reduce |Output• Strengths:– Easy to use! Developer just writes a couple offunctions– Moves compute to data• Schedules work on HDFS node with data if possible– Scans through data, reducing seeks– Automatic reliability and re-execution on failure4949
  • 50. © Hortonworks Inc. 2013HDFSClientNameNodeDataNode 1 DataNode 2 DataNode 3Big DataPut into HDFS(Via RPC or REST)Break the data into chunks anddistribute to the DataNodesThe DataNodes replicate the chunksHDFS in action