Big Data, Hadoop, Hortonworks and Microsoft HDInsight


Published on

Big Data is everywhere. And at the center of the big data discussion is Apache Hadoop, a next-generation enterprise data platform that allows you to capture, process and share the enormous amounts of new, multi-structured data that doesn’t fit into transitional systems.

With Microsoft HDInsight, powered by Hortonworks Data Platform, you can bridge this new world of unstructured content with the structured data we manage today. Together, we bring Hadoop to the masses as an addition to your current enterprise data architectures so that you can amass net new insight without net new headache.

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • For the visual thinkers out there, let’s expand our mathematical model to show some concrete examples.ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. Highly structured data in these systems is typically stored in SQL databases.Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data.Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data.Most folks would agree that video is “big” data. The analysis of what’s happening in that video (ie. What you, me, and others are doing in the video) may not be “big” but it is valuable and it does fit under our umbrella.Moreover, business data feeds and publicly available data sets are also “big data”.So we should not minimize our thinking to just data that flows through an organization.Ex. The mortgage-related data you may have COULD benefit from being blended with external data found in Zillow, for example.The government, for example, has the Open Data Initiative. Which means that more and more data is being made publicly available.One of the use cases I find interesting is the Predictive Policing use case where state/local law enforcement is using analytics applied to crime databases and other publicly available data to help predict where and when pockets of crime might be springing up. These proactive analytics efforts have yielded real reductions in crime!Anyhow, this is what Big Data means to me…hopefully it makes sense to you. It is important to note that we think of big data beyond the traditional concepts of volume, velocity and variety into transactions, interactions and observations. In reality, this IS the big data our customers are dealing with.
  • Gray Systems Lab, Dr. David DeWittFuture of query processingOne interface to query relational & Hadoop dataQuery data without moving itExpanding to other data sources in the futureSeamless integration with unstructured data & hadoopBreakthrough technologyGrey systems lab - DeWitt It’s going to dramatically simplify how users query relational and Hadoop dataFuture of query processingPioneered in the Jim Gray Systems Labs by David DeWitt, PolyBase is a federated query processor in SQL Server 2012 Parallel Data Warehouse which represents a breakthrough innovation from traditional query processing to join structured and unstructured data from Hadoop together. Without manual intervention, PolyBase Query Processor can accept a standard SQL query and combine tables from a relational source with tables from a Hadoop source directly through external tables.  As well, PolyBase Query Processor parallelizes the ability to import/export data to and from Hadoop giving PDW speed, simplicity, and responsiveness in addressing these new types of queries.Ability to issue standard T-SQL that joins relational data with unstructured data in Hadoop PolyBase rapidly imports/exports data between Hadoop and PDW in parallel3) PolyBase can query data in Hadoop directly without movement (with external tables)4) Created in “Gray Systems Labs” by David DeWitt
  • And that's the second thing I wanted to share with you this afternoon
  • We believe that Hadoop can be in a position to process more than half the world’s data. I’ve talked to a variety of industry analysts, and there’s not a big argument over Hadoop’s opportunity to achieve this. Some would argue it should be 2016 or 2017, rather than 2015. But we believe aggressive goals help focus people on the right things, so let’s keep it 2015 for now, and let’s see how close we can get. The point here is that this statement can act as our “north star” and help guide our way as we focus on our list of 5 items we can be doing:Be diligent stewards of the open source coreBe tireless innovators beyond the coreProvide robust data platform services & open APIsEnable ecosystem at each layer of the stackMake platform enterprise-ready & easy to use
  • Big Data, Hadoop, Hortonworks and Microsoft HDInsight

    1. 1. Polling QuestionHow Important is Big Data to your business?___ Very Important___ Somewhat Important___ Not Important Page 1 © Hortonworks Inc. 2012
    2. 2. Big Data, Hadoop, Hortonworks andMicrosoft HDInsight© Hortonworks Inc. 2012 Page 2
    3. 3. Your Presenters Jim Walker • Director, Prod Marketing • Computer Security and MDM •Saptak Sen • Senior Product Manager • Big Data & NoSQL Technology © Hortonworks Inc. 2012
    4. 4. Why Data Driven Business? Data driven decisions are better decisions – its as simple as that. Using big data enables mangers to decide on the basis of evidence rather than intuition. For that reason it has the potential to revolutionize management Harvard Business Review October 2012111001010000101001110101010001001010010010100100100001001000100100000100010000010001001001000100001011100001001000100010100100101111010100100010010010100101001001111 1001010010100011111010001001010000010010001010010111101010011001001010010001000111 Page 4 © Hortonworks Inc. 2012
    5. 5. Big Data: Organizational Game Changer Transactions + InteractionsPetabytes BIG DATA Mobile Web + Observations Sentiment SMS/MMS User Click Stream = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity Page 5 © Hortonworks Inc. 2012
    6. 6. Page 6© Hortonworks Inc. 2012
    7. 7. Page 7© Hortonworks Inc. 2012
    8. 8. Page 8© Hortonworks Inc. 2012
    9. 9. Polling QuestionWhat tools are you using with Big Data___ Hadoop___ NOSQL___ Other___ All the above Page 9 © Hortonworks Inc. 2012
    10. 10. Big Data: Optimize Outcomes at Scale Sports o p ti m i z e Championships Intelligence o p ti m i z e Detection Finance o p ti m i z e Algorithms Advertising o p ti m i z e Performance Fraud o p ti m i z e PreventionRetail / Wholesale o p ti m i z e Inventory turns Manufacturing o p ti m i z e Supply chains Healthcare o p ti m i z e Patient outcomes Education o p ti m i z e Learning outcomes Government o p ti m i z e Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 10 © Hortonworks Inc. 2012
    11. 11. A little history… it’s 2005 © Hortonworks Inc. 2012
    12. 12. …and then there was MapReduce Page 12 © Hortonworks Inc. 2012
    13. 13. Apache Hadoop: Center of Big Data StrategyOpen Source data management Key Characteristicswith scale-out storage & • Scalable – Efficiently store and processdistributed processing petabytes of data – Linear scale driven by additional HDFS processing and storage • ReliableStorage • Distributed across “nodes” – Redundant storage • Natively redundant – Failover across nodes and racks • Name node tracks locations • Flexible – Store all types of data in any format – Apply schema on analysis and Map Reduce sharing of the dataProcessing • Splits a task across processors • Economical “near” the data & assembles results – Use commodity hardware • Self-Healing, High Bandwidth – Open source software guards Clustered Storage against vendor lock-in Page 13 © Hortonworks Inc. 2012
    14. 14. What is a Hadoop “Distribution” Templeton WebHDFS Sqoop FlumeA complimentary set HCatalogof open source HBase Pig Hivetechnologies that MapReduce HDFSmake up a complete Ambari Oozie HAdata platform ZooKeeper• Tested and pre-packaged to ease installation and usage• Collects the right versions of the components that all have different release cycles and ensures they work together Page 14 © Hortonworks Inc. 2012
    15. 15. Apache Hadoop & Big Data Use Cases Big Data Transactions, Interactions, Observations Refine Explore Enrich Business Case Page 15 © Hortonworks Inc. 2012
    16. 16. 3 Patterns of Hadoop Use Refine Explore Enrich© Hortonworks Inc. 2012
    17. 17. 3 Patterns of Hadoop Use Refine Explore Enrich© Hortonworks Inc. 2012 Eintein Photo: Courtesy: Wikipedia Creative Commons
    18. 18. 3 Patterns of Hadoop Use Refine Explore Enrich© Hortonworks Inc. 2012 Eintein Photo: Courtesy: Wikipedia Creative Commons
    19. 19. Balancing Innovation & Stability • Hadoop is “pre-chasm” • Ecosystem still evolving relative %customers • Enterprises endure 1-3 year adoption cycle The CHASM Innovators, Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience Source: Geoffrey Moore - Crossing the Chasm Page 19 © Hortonworks Inc. 2012
    20. 20. 
    21. 21. 
    22. 22. 
    23. 23. DemonstrationMining Market Data – Showcase back testing on Interactive Data – Leveraging Excel Tool & BI Tool © Hortonworks Inc. 2012
    24. 24. Looking Ahead | Microsoft PolyBase “I’ve said it before: Massively Parallel Processing (MPP) data warehouse appliances are Big Data databases.” - Andrew Brust SQL Server PDW Single query for relational & Hadoop data Process data in place PolyBase Seamless: Regular T-SQL command Future expansion to other data sources © Hortonworks Inc. 2012
    25. 25. Hadoop Better on Windows • Active Directory • System Center Microsoft Data Connectivity • SQL Server / SQL Parallel Data Warehouse • Azure Storage / Azure Data Market Microsoft Business Intelligence (BI) • ODBC Connectivity
    26. 26. Leading Innovation at the CoreWe focus on innovating thecore of Apache Hadoop• Hortonworks employs the original MapR 1 17 Architects, Builders and Operators of Apache Hadoop Yahoo!• All Apache, NO holdbacks 9 100% of all code contributed back facebook 4 Cloudera to open source Apache projects 8 Number of Apache Hadoop Committers by Company Page 27 © Hortonworks Inc. 2012
    27. 27. What we do… We believe that by the end of 2015, more than half the worlds data will be processed by Apache Hadoop. Strategy: invest in Apache Hadoop to make it “The enterprise big data platform”Distribution Ecosystem Support• Hortonworks Data • Enable an Ecosystem of • Deliver highest quality Platform (HDP) Big Data Apps support and expertise• Enterprise Ready, Stable, • Our goal os to make sure all • Access to Apache Hadoop Reliable, Tested your tools work WITH Experts• 100% open source Hadoop • Hadoop training an• Built by the architects, • HDP is Hadoop for certification by the Hadoop builders and operators of • Microsoft experts(web, public, private) Apache Hadoop • Teradata Page 28 © Hortonworks Inc. 2012
    28. 28. Page 29© Hortonworks Inc. 2012
    29. 29. Hadoop in Enterprise Data Architectures Existing Business Infrastructure Web New Tech Datameer Tableau Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Tools EDW Latency/NoSQ L Custom Existing Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Big Data Sources (transactions, observations, interactions) Page 30 © Hortonworks Inc. 2012
    30. 30. Big Data: It’s About Scale & Structure RDBMS EDW MPP NoSQL Hadoop Structured data types Multi and unstructured Limited, no data processing processing Processing coupled with data Standards and structured governance Loosely structured Required on write schema Required on read Reads are fast speed Writes are fast Software License cost Support only Known entity resources Growing, complexities, wide Interactive OLAP Analytics Data Discovery Complex ACID Transactions best fit use Processing unstructured data Operational Data Store Massive Storage/Processing Page 31 © Hortonworks Inc. 2012