SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)


Published on

In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
  • SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

    1. 1. SQLSaturday #230 Rheinland Sascha Dittmann Softwarearchitekt & Entwickler – Ernst & Young GmbH www.sascha-dittmann.de Georg Urban Snr. Technology Solution Professional | Data Platform georg.urban@microsoft.com 13.07.2013
    3. 3. Big Data Characteristics: „3 Vs“
    4. 4. How to deal with the „3 Vs“?
    5. 5. A brief history of Hadoop 2002: Apache Nutch open source search engine ist started by Doug Cutting 2003: Google publishes a paper on GFS (Google Distributed File System) 2004: Nutch Distributed Files System (NDFS) is developed 2004: Google publishes a paper on MapReduce 2005: MapReduce is implemented on NDFS 2006: Doug Cutting joins Yahoo! & starts Apache Hadoop subproject 2008: Hadoop is made a Apache top level project. …Yahoo„s search index runs on a 10.000 node cluster …Hadoop breaks record on 1TB sort: 209s on 910 nodes ...New York Times converts 4TB archives in PDFs in 24h on 100 nodes http://labs.google.com/papers/mapreduce.htm Today: Hadoop becomes a synonym for Big Data processing
    6. 6. Hadoop: The popular Face of Big Data
    7. 7. RDBMS & Hadoop Comparison Traditional RDBMS MapReduce Data Volume Terabytes Petabytes / Hexabytes Access Interactiv & Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low (BASE*) Scaling non linear Linear DBA Ratio 1:40 1:3000 Quelle: Tom White’s Hadoop: The Definitive Guide *Basically Available, Soft state, Eventual consistency
    8. 8. MapReduce is simple… (well: basically)
    9. 9. The Hadoop Ecosystem (simplified) Quelle: Tom White’s Hadoop: The Definitive Guide
    10. 10. The Hadoop Ecosystem (parts of it…) HBase (Column DB) Hive Mahout Oozie Sqoop HBase/Cassandra/Couch/ MongoDB Avro Zookeeper Pig Karmasphere Flume Cascad- ing R Ambari HCatalog Datameer Hortonworks Cloudera SplunkHStreaming MapRHadapt Hadoop = MapReduce + HDFS
    11. 11. There‟s even more: Mahout for machine learning  Scalable machine learning library that leverages the Hadoop infrastructure  Key use cases:  Recommendation mining  Clustering  Classification  Algorithmns: K-means Clustering, Naïve Bayes, Decision Tree, Neural network, Hierarchical Clustering, Positive Matrix Factorization and more…
    12. 12. R for statistical computing  An open and extensible statistical computing environment  Based on the S language  Used by Data Scientists to explore data and generate graphical output  A well-developed programming language  Many “Packages” available to extend R
    13. 13. …but: That‟s not Enterprise ready… Really not…
    15. 15. Big Data in the Enterprise should… fit in an present IT Infrastructure be easy to manage rely on existing skill sets be cost optimized
    16. 16. Why Apache Hadoop on Windows?  According to IDC Windows Server held 73% market share in 2012  Hadoop was traditionally built for Linux servers so there are a large number of underserved organizations  According to 2012 Barclays CIO study big data outranks virtualization as #1 trend driving spending initiatives  Unstructured data growth exceeds 80% year/year in most enterprises  Apache Hadoop is the defacto big data platform for processing massive amounts of unstructured data  Complementary to existing Microsoft technologies  There is a huge untapped community of Windows developers and ecosystem partners  A strong Microsoft-Hortonworks partnership and 18 months of development makes this a natural next step
    17. 17. OS Cloud VM Appliance Enterprise Hadoop Distribution Hortonworks Data Platform (HDP) Hadoop designed for Enterprises The “really complete“ Open Source Distribution Eco-System designed for InteroperabilityPLATTFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Management of Hadoop Environment Store, Process & Connect HORTONWORKS DATA PLATFORM (HDP) Distributed Data Storage & Processing Enterprise Availability
    18. 18. Leadership that Starts at the Core  Driving next generation Hadoop  YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery  420k+ lines authored since 2006  More than twice nearest contributor  Deeply integrating w/ecosystem  Enabling new deployment platforms  (ex. Windows & Azure, Linux & VMware HA)  Creating deeply engineered solutions  (ex. Teradata big data appliance)  All Apache, NO holdbacks  100% of code contributed to Apache
    19. 19. HDInsight Windows optimized Hadoop Big Data @Microsoft Microsoft HDInsight Server on Windows Server Windows Azure HDInsight Service (Cloud) Enterprise Ready Hadoop Simplicity & Managebility of Windows AD Integration Monitoring (System Center) Integrated in Microsoft Business Intelligence JavaScript, HiveODBC, .NET … Up and running in minutes with HDInsight Service
    20. 20. Microsoft Big Data Solution (two months ago…)
    22. 22. Windows Azure: Elastic Big Data
    23. 23. Windows Azure HDInsight Service Hadoop Cluster
    24. 24. Hadoop on Azure Azure Blob Storage Name Node Data Node Data Node Data Node Data Node HDFS On Premise Enterprise Content • Transactional DBs • On Prem logs • Internal sensors Cloud Enterprise Content • Generated in Azure 3rd Party Content • Azure Datamarket • Generated/stored elsewhere • Public content • Delivered online Azure Blob Storage SQL Azure Application end point
    25. 25. Using Blob Storage From HDInsight  HDInsight cluster is bound to one “default” blob storage account & container at cluster create time  Using the “default” container requires no special addressing to access (“/” == root folder, etc)  Access additional blob storage accounts or containers:  Storage accounts need to be registered in site-config.xml: asv[s]://<container>@<account>.blob.core.windows.net/<path> <property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value> </property>
    26. 26. Transporting Data with AzCopy  Utility for moving data to/from Azure Blob Storage (like robocopy)  50MB/s transfer rate in data center Container Blob Name mycontainer a.txt mycontainer b.txt mycontainer dir1c.txt mycontainer dir1dir2d.txt
    27. 27. Intro to HDInsight
    28. 28. Map/Reduce Map Sort Shuffle DataNode Map Sort Shuffle DataNode Map Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,[22,33,55] 1952,-11 1949,0 1950,55 1952,-11
    29. 29. Map/Reduce mit Combine Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11
    30. 30. Map/Reduce (JavaScript)
    31. 31. Verfeinern mit Pig Latin pig .from("/user/Sascha/input/twitter") .mapReduce("/user/…/FollowersCount.js" , "User, Followers:long") .orderBy("Followers DESC") .take(10) .to("/user/Sascha/output/Top10Followers")
    32. 32. Pig Latin
    33. 33. Map in C# (Classic)
    34. 34. Reduce in C# (Classic)
    35. 35. Map/Reduce mit C#
    36. 36. .NET Job Submission Framework (Map)
    37. 37. .NET Job Submission Framework (Reduce)
    38. 38. Vielen Dank an die Volunteers! 13.07.2013 |
    39. 39. Große Verlosung!  Am Ende der Veranstaltung (ca. 18:00 Uhr)  Gewinnt viele Preise!  Deshalb: 13.07.2013 | Besucht unsere Sponsoren!
    40. 40. Unsere „You Rock! “ Sponsoren 13.07.2013 |
    41. 41. Vielen Dank an all unsere Sponsoren! 13.07.2013 | Gold Silber Bronze
    42. 42. Media Sponsoren: 13.07.2013 |
    43. 43. Hands-on event: PASS Camp 2013! 13.07.2013 |