Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hortonworks Big Data & Hadoop


Published on

Presenter: Ofer Mendelevitch of Hortonworks > Learn the benefits of big data for data scientists, and how Hadoop and HDInsight fit into the modern data architecture and enable data-driven products.

You'll learn:

* What data science actually means
* The term "data products"
* The benefits of using big data for data scientists
* How Hadoop helps data scientists work with big data
* About HDInsight, the big data platform from Microsoft and Hortonworks

Published in: Business, Technology

Hortonworks Big Data & Hadoop

  1. 1. © Hortonworks Inc. 2013Big Data, Data Science & HadoopOfer MendelevitchSan Francisco Bay AreaMicrosoft BusinessIntelligence User GroupMay 2013
  2. 2. © Hortonworks Inc. 2013 Page 2Who am I?Director of Data Sciences @ Hortonworks• Data science with Hadoop• Professional servicesPreviously…A Chess Dad
  3. 3. © Hortonworks Inc. 2013 Page 3
  4. 4. © Hortonworks Inc. 2013 Page 4Gartner’s 3 V’s of big data:VolumeVelocityVarietySize of the dataIngest speedResponse latencyDiverse sourcesFormat, structureData quality
  5. 5. © Hortonworks Inc. 2013What Makes Up Big Data?MegabytesGigabytesTerabytesPetabytesPurchase detailPurchase recordPayment recordERPERPCRMCRMWEBWEBBIG DATABIG DATAOffer detailsSupport ContactsCustomer TouchesSegmentationWeb logsOffer historyA/B testingDynamic PricingAffiliate NetworksSearch MarketingBehavioral TargetingDynamic FunnelsUser Generated ContentMobile WebSMS/MMSSentimentExternal DemographicsHD Video, Audio, ImagesSpeech to TextProduct/Service LogsSocial Interactions & FeedsBusiness Data FeedsUser Click StreamSensors / RFID / DevicesSpatial & GPS CoordinatesIncreasing Data Variety and ComplexityTransactions + Interactions+ Observations= BIG DATAPage 5
  6. 6. © Hortonworks Inc. 2013 Page 6• Sensors/devices• Online: social, forums, etc• Event logs• Etc etc…But also:• Data that was “thrown away “ previouslyWhere is all this data coming from?
  7. 7. © Hortonworks Inc. 2013 Page 7I like a quote from Michael Franklin (UCB):“Big Data is any data that is expensive tomanage and hard to extract value from”It’s a relative term.Today’s big data may be tomorrow’s small data.Ok… so what is big data?
  8. 8. © Hortonworks Inc. 2013 Page 8
  9. 9. © Hortonworks Inc. 2013 Page 9“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”What is a data product?
  10. 10. © Hortonworks Inc. 2013 Page 10Example 1: Google Adwords
  11. 11. © Hortonworks Inc. 2013 Page 11Example 2: People you may know
  12. 12. © Hortonworks Inc. 2013 Page 12Example 3: spell correction
  13. 13. © Hortonworks Inc. 2013 Page 13
  14. 14. © Hortonworks Inc. 2013 Page 14What is data science?#1: Extracting deep meaning from data(data mining; finding “gems” in data)
  15. 15. © Hortonworks Inc. 2013 Page 15What is data science?#2: Building data products(Delivering gems on a regular basis)Pre-process Build model SQLPeriodic batch processingOnline serving
  16. 16. © Hortonworks Inc. 2013 Page 16Common data science tasksDescriptiveDescriptiveClusteringDetect natural groupingsClusteringDetect natural groupingsOutlier detectionDetect anomaliesOutlier detectionDetect anomaliesAffinity AnalysisCo-occurrence patternsAffinity AnalysisCo-occurrence patternsPredictivePredictiveClassificationPredict a categoryClassificationPredict a categoryRegressionPredict a valueRegressionPredict a valueRecommendationPredict a preferenceRecommendationPredict a preference
  17. 17. © Hortonworks Inc. 2013 Page 17
  18. 18. © Hortonworks Inc. 2013A brief history of Apache HadoopPage 182013Focus on INNOVATION2005: Yahoo! createsteam under E14 towork on HadoopFocus on OPERATIONS2008: Yahoo team extends focus tooperations to support multipleprojects & growing clustersYahoo! begins toOperate at scaleEnterpriseHadoopApache ProjectEstablishedHortonworksData Platform2004 2008 2010 20122006STABILITY2011: Hortonworks created to focus on“Enterprise Hadoop“. Starts with 24key Hadoop engineers from Yahoo
  19. 19. © Hortonworks Inc. 2013ApplianceCloudOS / VMHDP: Enterprise-Ready HadoopHORTONWORKSDATA PLATFORM (HDP)PLATFORM SERVICESHADOOP COREEnterprise Readiness: HA,DR, Snapshots, Security, …DistributedStorage & ProcessingHDFSMAP REDUCEDATASERVICESStore,Process andAccess DataHCATALOGHIVEPIGHBASESQOOPFLUMEOPERATIONALSERVICESManage &Operate atScaleOOZIEAMBARI
  20. 20. © Hortonworks Inc. 2013Core Hadoop: HDFS & Map ReduceDeliver high-scale storage & processing• HDFS: distributed, self-healing data store• Map-reduce: distributed computation framework thathandles the complexities of distributed programmingPage 20
  21. 21. © Hortonworks Inc. 2013 Page 21Keys to Hadoop’s power• Computation co-located with data– Data and computation system co-designed and co-developed to work together• Process data in parallel across thousands of“commodity” hardware nodes– Self-healing; failure handled by software• Designed for one write and multiple reads– There are no random writes– Optimized for minimum seek on hard drives
  22. 22. © Hortonworks Inc. 2013Inside HDP for WindowsPage 22HortonworksData Platform (HDP)For Windows• 100% Open SourceEnterprise Hadoop• Component and versioncompatible with MicrosoftHDInsight• Availability• Beta release available now• GA early 2Q 2012PLATFORM SERVICESHADOOP COREDATASERVICESOPERATIONALSERVICESManage &Operate atScaleStore,Process andAccess DataHORTONWORKSDATA PLATFORM (HDP)For WindowsDistributedStorage & ProcessingHDFSWEBHDFSMAP REDUCEHCATALOGHIVEPIGSQOOPOozie
  23. 23. © Hortonworks Inc. 2013Seamless Interoperability with Your Microsoft Tools• Integrated with Microsoft toolsfor native big data analysis– Bi-directional connectors for SQLServer and SQL Azure through SQOOP– Excel ODBC integration through Hive• Addressing demand for Hadoopon Windows– Ideal for Windows customers withHadoop operational experience• Enables all common Hadoopworkloads– Data refinement and ETL offload forhigh-volume data landing– Data exploration for discovery of newbusiness opportunitiesPage 23APPLICATIONSDATASYSTEMSMicrosoft ApplicationsHORTONWORKSDATA PLATFORMFor WindowsDATASOURCESMOBILEDATAOLTP,POSSYSTEMSTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)
  24. 24. © Hortonworks Inc. 2013 Page 24
  25. 25. © Hortonworks Inc. 2013 Page 25Data Science, now with more data…
  26. 26. © Hortonworks Inc. 2013 Page 26Benefit #1:Explore full datasetsBenefits of Hadoop for datascience
  27. 27. © Hortonworks Inc. 2013 Page 27Explore large datasets directly with HadoopMeasure/EvaluateAcquireClean DataVisualize, GrokModelFull dataset stored on HadoopResearcher laptopR, Matlab, SAS, etc
  28. 28. © Hortonworks Inc. 2013 Page 28Integrate Hadoop in your data analysis flow•Full dataset resides in Hadoop• Typical Hadoop tasks:–Simple statistics: mean, median, correlation–Text pre-processing: grep, regex, NLP–Dimensionality reduction: PCA, SVD, clustering, etc–Random sampling: with or without replacement, by unique–K-fold cross-validation
  29. 29. © Hortonworks Inc. 2013 Page 29Benefit #2:Mine larger datasetsBenefits of Hadoop for datascience
  30. 30. © Hortonworks Inc. 2013 Page 30More data -> better outcomesBanko & Brill, 2001Halevy, Norvig & Pereira, 2009
  31. 31. © Hortonworks Inc. 2013 Page 31Learning algorithms with large datasets…Challenges:•Data won’t fit in memory•Learning takes a lot longer…Using Hadoop:•Distribute data across nodes in the Hadoop cluster•Implement a distributed/parallel algorithm
  32. 32. © Hortonworks Inc. 2013 Page 32Benefit #3:Large-scale data preparationBenefits of Hadoop for datascience
  33. 33. © Hortonworks Inc. 2013 Page 3380% of data science work is data preparationStrip awayHTML/PDF/DOC/PPTEntity resolutionDocument vectorgenerationSampling, filteringJoinsRaw DataProcessedDataTerm normalization
  34. 34. © Hortonworks Inc. 2013 Page 34Hadoop is ideal for batch data preparation andcleanup of large datasets
  35. 35. © Hortonworks Inc. 2013 Page 35Benefit #4:Accelerate data-driven innovationBenefits of Hadoop for datascience
  36. 36. © Hortonworks Inc. 2013 Page 36Barriers to speed with traditional data architectures• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovationI neednew datacollectingFinally,we startcollectingLet mesee… is itany good?Start 6 months 9 monthsSchema change project
  37. 37. © Hortonworks Inc. 2013 Page 37“Schema on read” means faster time-to-innovation• Hadoop uses “schema on read”• Low barrier for data-driven innovationI neednew dataLet’s just putLet’s just putit in a folderon HDFSLet mesee… is itany good?Start 3 months 6 monthsMy model isawesome!
  38. 38. © Hortonworks Inc. 2013Quick start: Hortonworks Sandbox• What is it– A free download of a virtualized single-node implementation of the enterprise-readyHortonworks Data Platform– A personal Hadoop environment– An integrated learning environment with frequently, easily updatable hands-onstep-by-step tutorials• What it does– Dramatically accelerates the process of learning Apache Hadoop– Accelerate and validates the use of Hadoop within your unique data architecture– Use your data to explore and investigate your use cases• ZERO to big data in 15 minutesPage 38Download Hortonworks up for Training for in-depth
  39. 39. Hadoop SummitPage 39Architecting the Future of Big Data• June 26-27, 2013- San Jose ConventionCenter• Co-hosted by Hortonworks & Yahoo!• Theme: Enabling the Next GenerationEnterprise Data Platform• 90+ Sessions and 7 Tracks• Community Focused Event– Sessions selected by a Conference Committee– Community Choice allowed public to vote forsessions they want to see• Pre-event training classes– Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users– Understanding Microsoft HDInsight and ApacheHadoop– Developing Solutions with Apache Hadoop –HDFS and MapReduce– Applying Data Science using Apache Hadoop• 10% discount code:
  40. 40. © Hortonworks Inc. 2013 Page 40Thank you!Any Questions?Ofer MendelevitchDirector, Data Sciences @, @hortonworksWe’re hiring!