Your SlideShare is downloading. ×
Hortonworks Big Data & Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hortonworks Big Data & Hadoop


Published on

Presenter: Ofer Mendelevitch of Hortonworks > Learn the benefits of big data for data scientists, and how Hadoop and HDInsight fit into the modern data architecture and enable data-driven …

Presenter: Ofer Mendelevitch of Hortonworks > Learn the benefits of big data for data scientists, and how Hadoop and HDInsight fit into the modern data architecture and enable data-driven products.

You'll learn:

* What data science actually means
* The term "data products"
* The benefits of using big data for data scientists
* How Hadoop helps data scientists work with big data
* About HDInsight, the big data platform from Microsoft and Hortonworks

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. © Hortonworks Inc. 2013Big Data, Data Science & HadoopOfer MendelevitchSan Francisco Bay AreaMicrosoft BusinessIntelligence User GroupMay 2013
  • 2. © Hortonworks Inc. 2013 Page 2Who am I?Director of Data Sciences @ Hortonworks• Data science with Hadoop• Professional servicesPreviously…A Chess Dad
  • 3. © Hortonworks Inc. 2013 Page 3
  • 4. © Hortonworks Inc. 2013 Page 4Gartner’s 3 V’s of big data:VolumeVelocityVarietySize of the dataIngest speedResponse latencyDiverse sourcesFormat, structureData quality
  • 5. © Hortonworks Inc. 2013What Makes Up Big Data?MegabytesGigabytesTerabytesPetabytesPurchase detailPurchase recordPayment recordERPERPCRMCRMWEBWEBBIG DATABIG DATAOffer detailsSupport ContactsCustomer TouchesSegmentationWeb logsOffer historyA/B testingDynamic PricingAffiliate NetworksSearch MarketingBehavioral TargetingDynamic FunnelsUser Generated ContentMobile WebSMS/MMSSentimentExternal DemographicsHD Video, Audio, ImagesSpeech to TextProduct/Service LogsSocial Interactions & FeedsBusiness Data FeedsUser Click StreamSensors / RFID / DevicesSpatial & GPS CoordinatesIncreasing Data Variety and ComplexityTransactions + Interactions+ Observations= BIG DATAPage 5
  • 6. © Hortonworks Inc. 2013 Page 6• Sensors/devices• Online: social, forums, etc• Event logs• Etc etc…But also:• Data that was “thrown away “ previouslyWhere is all this data coming from?
  • 7. © Hortonworks Inc. 2013 Page 7I like a quote from Michael Franklin (UCB):“Big Data is any data that is expensive tomanage and hard to extract value from”It’s a relative term.Today’s big data may be tomorrow’s small data.Ok… so what is big data?
  • 8. © Hortonworks Inc. 2013 Page 8
  • 9. © Hortonworks Inc. 2013 Page 9“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”What is a data product?
  • 10. © Hortonworks Inc. 2013 Page 10Example 1: Google Adwords
  • 11. © Hortonworks Inc. 2013 Page 11Example 2: People you may know
  • 12. © Hortonworks Inc. 2013 Page 12Example 3: spell correction
  • 13. © Hortonworks Inc. 2013 Page 13
  • 14. © Hortonworks Inc. 2013 Page 14What is data science?#1: Extracting deep meaning from data(data mining; finding “gems” in data)
  • 15. © Hortonworks Inc. 2013 Page 15What is data science?#2: Building data products(Delivering gems on a regular basis)Pre-process Build model SQLPeriodic batch processingOnline serving
  • 16. © Hortonworks Inc. 2013 Page 16Common data science tasksDescriptiveDescriptiveClusteringDetect natural groupingsClusteringDetect natural groupingsOutlier detectionDetect anomaliesOutlier detectionDetect anomaliesAffinity AnalysisCo-occurrence patternsAffinity AnalysisCo-occurrence patternsPredictivePredictiveClassificationPredict a categoryClassificationPredict a categoryRegressionPredict a valueRegressionPredict a valueRecommendationPredict a preferenceRecommendationPredict a preference
  • 17. © Hortonworks Inc. 2013 Page 17
  • 18. © Hortonworks Inc. 2013A brief history of Apache HadoopPage 182013Focus on INNOVATION2005: Yahoo! createsteam under E14 towork on HadoopFocus on OPERATIONS2008: Yahoo team extends focus tooperations to support multipleprojects & growing clustersYahoo! begins toOperate at scaleEnterpriseHadoopApache ProjectEstablishedHortonworksData Platform2004 2008 2010 20122006STABILITY2011: Hortonworks created to focus on“Enterprise Hadoop“. Starts with 24key Hadoop engineers from Yahoo
  • 19. © Hortonworks Inc. 2013ApplianceCloudOS / VMHDP: Enterprise-Ready HadoopHORTONWORKSDATA PLATFORM (HDP)PLATFORM SERVICESHADOOP COREEnterprise Readiness: HA,DR, Snapshots, Security, …DistributedStorage & ProcessingHDFSMAP REDUCEDATASERVICESStore,Process andAccess DataHCATALOGHIVEPIGHBASESQOOPFLUMEOPERATIONALSERVICESManage &Operate atScaleOOZIEAMBARI
  • 20. © Hortonworks Inc. 2013Core Hadoop: HDFS & Map ReduceDeliver high-scale storage & processing• HDFS: distributed, self-healing data store• Map-reduce: distributed computation framework thathandles the complexities of distributed programmingPage 20
  • 21. © Hortonworks Inc. 2013 Page 21Keys to Hadoop’s power• Computation co-located with data– Data and computation system co-designed and co-developed to work together• Process data in parallel across thousands of“commodity” hardware nodes– Self-healing; failure handled by software• Designed for one write and multiple reads– There are no random writes– Optimized for minimum seek on hard drives
  • 22. © Hortonworks Inc. 2013Inside HDP for WindowsPage 22HortonworksData Platform (HDP)For Windows• 100% Open SourceEnterprise Hadoop• Component and versioncompatible with MicrosoftHDInsight• Availability• Beta release available now• GA early 2Q 2012PLATFORM SERVICESHADOOP COREDATASERVICESOPERATIONALSERVICESManage &Operate atScaleStore,Process andAccess DataHORTONWORKSDATA PLATFORM (HDP)For WindowsDistributedStorage & ProcessingHDFSWEBHDFSMAP REDUCEHCATALOGHIVEPIGSQOOPOozie
  • 23. © Hortonworks Inc. 2013Seamless Interoperability with Your Microsoft Tools• Integrated with Microsoft toolsfor native big data analysis– Bi-directional connectors for SQLServer and SQL Azure through SQOOP– Excel ODBC integration through Hive• Addressing demand for Hadoopon Windows– Ideal for Windows customers withHadoop operational experience• Enables all common Hadoopworkloads– Data refinement and ETL offload forhigh-volume data landing– Data exploration for discovery of newbusiness opportunitiesPage 23APPLICATIONSDATASYSTEMSMicrosoft ApplicationsHORTONWORKSDATA PLATFORMFor WindowsDATASOURCESMOBILEDATAOLTP,POSSYSTEMSTraditional Sources(RDBMS, OLTP, OLAP)New Sources(web logs, email, sensor data, social media)
  • 24. © Hortonworks Inc. 2013 Page 24
  • 25. © Hortonworks Inc. 2013 Page 25Data Science, now with more data…
  • 26. © Hortonworks Inc. 2013 Page 26Benefit #1:Explore full datasetsBenefits of Hadoop for datascience
  • 27. © Hortonworks Inc. 2013 Page 27Explore large datasets directly with HadoopMeasure/EvaluateAcquireClean DataVisualize, GrokModelFull dataset stored on HadoopResearcher laptopR, Matlab, SAS, etc
  • 28. © Hortonworks Inc. 2013 Page 28Integrate Hadoop in your data analysis flow•Full dataset resides in Hadoop• Typical Hadoop tasks:–Simple statistics: mean, median, correlation–Text pre-processing: grep, regex, NLP–Dimensionality reduction: PCA, SVD, clustering, etc–Random sampling: with or without replacement, by unique–K-fold cross-validation
  • 29. © Hortonworks Inc. 2013 Page 29Benefit #2:Mine larger datasetsBenefits of Hadoop for datascience
  • 30. © Hortonworks Inc. 2013 Page 30More data -> better outcomesBanko & Brill, 2001Halevy, Norvig & Pereira, 2009
  • 31. © Hortonworks Inc. 2013 Page 31Learning algorithms with large datasets…Challenges:•Data won’t fit in memory•Learning takes a lot longer…Using Hadoop:•Distribute data across nodes in the Hadoop cluster•Implement a distributed/parallel algorithm
  • 32. © Hortonworks Inc. 2013 Page 32Benefit #3:Large-scale data preparationBenefits of Hadoop for datascience
  • 33. © Hortonworks Inc. 2013 Page 3380% of data science work is data preparationStrip awayHTML/PDF/DOC/PPTEntity resolutionDocument vectorgenerationSampling, filteringJoinsRaw DataProcessedDataTerm normalization
  • 34. © Hortonworks Inc. 2013 Page 34Hadoop is ideal for batch data preparation andcleanup of large datasets
  • 35. © Hortonworks Inc. 2013 Page 35Benefit #4:Accelerate data-driven innovationBenefits of Hadoop for datascience
  • 36. © Hortonworks Inc. 2013 Page 36Barriers to speed with traditional data architectures• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovationI neednew datacollectingFinally,we startcollectingLet mesee… is itany good?Start 6 months 9 monthsSchema change project
  • 37. © Hortonworks Inc. 2013 Page 37“Schema on read” means faster time-to-innovation• Hadoop uses “schema on read”• Low barrier for data-driven innovationI neednew dataLet’s just putLet’s just putit in a folderon HDFSLet mesee… is itany good?Start 3 months 6 monthsMy model isawesome!
  • 38. © Hortonworks Inc. 2013Quick start: Hortonworks Sandbox• What is it– A free download of a virtualized single-node implementation of the enterprise-readyHortonworks Data Platform– A personal Hadoop environment– An integrated learning environment with frequently, easily updatable hands-onstep-by-step tutorials• What it does– Dramatically accelerates the process of learning Apache Hadoop– Accelerate and validates the use of Hadoop within your unique data architecture– Use your data to explore and investigate your use cases• ZERO to big data in 15 minutesPage 38Download Hortonworks up for Training for in-depth
  • 39. Hadoop SummitPage 39Architecting the Future of Big Data• June 26-27, 2013- San Jose ConventionCenter• Co-hosted by Hortonworks & Yahoo!• Theme: Enabling the Next GenerationEnterprise Data Platform• 90+ Sessions and 7 Tracks• Community Focused Event– Sessions selected by a Conference Committee– Community Choice allowed public to vote forsessions they want to see• Pre-event training classes– Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users– Understanding Microsoft HDInsight and ApacheHadoop– Developing Solutions with Apache Hadoop –HDFS and MapReduce– Applying Data Science using Apache Hadoop• 10% discount code:
  • 40. © Hortonworks Inc. 2013 Page 40Thank you!Any Questions?Ofer MendelevitchDirector, Data Sciences @, @hortonworksWe’re hiring!