• Save
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler

  • 2,639 views
Uploaded on

Slides from my Microsoft HDInsight session at the dotnet Cologne 2013

Slides from my Microsoft HDInsight session at the dotnet Cologne 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,639
On Slideshare
694
From Embeds
1,945
Number of Embeds
8

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 1,945

http://blogs.dotnetgerman.com 1,768
http://www.sascha-dittmann.de 158
http://feeds.feedburner.com 10
http://sascha-dittmann.de 3
http://timekeeper.sascha-dittmann.de 3
http://www.dotnetgerman.com 1
http://webcache.googleusercontent.com 1
http://dittmann4.rssing.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sascha DittmannBlog: http://www.sascha-dittmann.deTwitter: @SaschaDittmannMicrosoft HDInsight für .NET EntwicklerBig Data Analysen mit JavaScript und C#
  • 2. Large Hadron Collider (CERN Schweiz)http://public.web.cern.ch/public/en/lhc/Computing-en.htmlDer LHC Teilchenbeschleunigerproduziert 15 PB Messdaten pro Jahr*
  • 3. Woher kommt Big Data70% of U.S.smartphone ownersregularly shop onlinevia their devices.44% of users(350M people)access Facebook viamobile devices.50% ofmillennials usemobile devices toresearch products.60%of U.S.mobile data will beaudio and videostreaming by 2014.Mobility2/3of the worldsmobile data traffic willbe video by 2016.33%of BI willbe consumed viahandheld devicesby 2013.Gaming consoles arenow used an average of1.5 hrs/wkto connect to theInternet.80%growth ofunstructured data ispredicted over thenext five years.1.8 zettabytesof digital data werein useworldwide in2011, up 30%from 2010.1 in 4Facebook usersadd their locationto posts(2B/month).500M Tweetsare hosted onTwitter each day.38% of peoplerecommend a brandthey “like” or followon a social network.100MFacebook“likes” per day.Brands getBigDataSocialMobility Cloud
  • 4. Big Data SzenarienWeb appoptimizationSmart metermonitoringEquipmentmonitoringAdvertisinganalysisLife sciencesresearchFrauddetectionHealthcareoutcomesWeatherforecastingNatural resourceexplorationSocial networkanalysisChurnanalysisTraffic flowoptimizationIT infrastructureoptimizationLegaldiscovery
  • 5. Big Data ist sexyhttp://hbr.org/
  • 6. Apache Hadoop EcosystemMapReduce (Job Scheduling/Execution System)HDFS(Hadoop Distributed File System)HBase (Column DB)Pig (DataFlow)Hive(Warehouseand DataAccess)Oozie(Workflow)SqoopTraditional BI ToolsHBase / Cassandra(Columnar NoSQL Databases)Avro(Serialization)Zookeeper(Coordination)ApacheMahoutCascading(programmingmodel)Hadoop = MapReduce + HDFSFlume
  • 7. Microsoft HDInsightMapReduce (Job Scheduling/Execution System)HDFS(Hadoop Distributed File System)HBase (Column DB)Pig(DataFlow)Hive(Warehouse and DataAccess)Oozie(Workflow)SqoopTraditional BI ToolsHBase / Cassandra(Columnar NoSQL Databases)Avro(Serialization)Zookeeper(Coordination)ApacheMahoutCascading(programming model)Hadoop = MapReduce + HDFSFlumeWindowsSystemCenterActiveDirectoryVisual Studio
  • 8. Hadoop Distributed File System (HDFS)BootvorgangAusfallsicherheitBenutzeranfrage
  • 9. Hadoop Distributed File System (HDFS)BootvorgangAusfallsicherheitBenutzeranfrage
  • 10. BootvorgangAusfallsicherheitBenutzeranfrageHadoop Distributed File System (HDFS)
  • 11. Hadoop Distributed File System (HDFS) Portable Operating System Interface (POSIX) Replikation auf mehrere Datenknotenjs> #ls /user/Sascha/input/ncdcFound 9 itemsdrwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/alldrwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadatadrwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/microdrwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab-rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt-rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
  • 12. HDInsight Dashboard Demo
  • 13. Map/Reduce am Beispiel von Messdaten0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+999999999990043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+999999999990043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+999999999990043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+999999999990043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999Jahr Lufttemperatur
  • 14. Map/Reduce am Beispiel von Messdaten0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+999999999990043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+999999999990043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+999999999990043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+999999999990043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999Messqualität
  • 15. Map/ReduceMapSortShuffleDataNodeMapSortShuffleDataNodeMapSortShuffleDataNodeReduce0067011990999991950051507004+687500043011990999991950051512004+687500043011990999991950051518004+687500043012650999991949032412004+623000043012650999991949032418004+623001949,01950,221950,551952,-111950,331949,01950,[22,33,55]1952,-111949,01950,551952,-11
  • 16. Map/Reduce mit Combine MethodeMapCombineSortShuffleDataNodeMapCombineSortShuffleDataNodeMapCombineSortShuffleDataNodeReduce0067011990999991950051507004+687500043011990999991950051512004+687500043011990999991950051518004+687500043012650999991949032412004+623000043012650999991949032418004+623001949,01950,221950,551952,-111950,331949,01950,551952,-111950,331949,01950,[33,55]1952,-111949,01950,551952,-11
  • 17. Map/Reduce am Beispiel von Messdaten
  • 18. Wörter zählen mit JavaScript (Map)
  • 19. Wörter zählen mit JavaScript (Reduce)
  • 20. Map/Reduce mit JavaScript
  • 21. Verfeinern mit Pig Latinpig.from("/user/Sascha/input/texte").mapReduce("/user/…/WordCount.js", "Woerter, Anzahl:long").orderBy("Anzahl DESC").take(15).to("/user/Sascha/output/Top15Woerter")
  • 22. Pig Latin
  • 23. Wörter zählen mit C# (Map - Classic)
  • 24. Wörter zählen mit C# (Reduce - Classic)
  • 25. Map/Reduce mit C#
  • 26. .NET Job Submission Framework (Map)
  • 27. .NET Job Submission Framework (Reduce)
  • 28. Externe Hive-Tabelle erzeugenCREATE EXTERNAL TABLE twitter_raw(tweet_json STRING)COMMENT Twitter Sample DataROW FORMAT DELIMITED LINES TERMINATEDBY 10STORED AS TEXTFILELOCATION /example/twitterdata;
  • 29. Twitter JSON{"possibly_sensitive_editable":true,"place":null,"text":"Pre - #ConvCloud chat insights. " #Cloud Security, are we missing the point?" from@christianve http://t.co/Smo0CPvb #HP #cloudsource”,"id_str":"223418953114984448”,"favorited":false,"possibly_sensitive":false,"created_at":"Thu Jul 12 14:10:04 +0000 2012","retweeted":false,"retweet_count":0,"user":{"is_translator":false,"profile_use_background_image":true,"profile_image_url_https":"https://si0.twimg.com/profile_images/640456324/Paul_Calento_normal.jpg","id_str":"103006513","profile_text_color":"333333","statuses_count":5984,"following":null,"followers_count":744,"default_profile_image":false,"profile_link_color":"FF3300",}, …..}
  • 30. JSON in Hive interpretierenFROM twitter_rawINSERT OVERRIDE TABLE twitter_tempSELECT get_json_object(tweet_json, $.created_at),substr(get_json_object(tweet_json, $.created_at),9,2),substr(get_json_object(tweet_json, $.created_at),12,8),get_json_object(tweet_json, $.in_reply_to_user_id_str),get_json_object(tweet_json, $.text),get_json_object(tweet_json, $.contributors),get_json_object(tweet_json, $.retweeted),get_json_object(tweet_json, $.truncated),get_json_object(tweet_json, $.favorited),cast(get_json_object(tweet_json, $.retweet_count) as int),/* … */get_json_object(tweet_json, $.user.profile_image_url_https),cast(get_json_object(tweet_json, $.user.followers_count) as int),get_json_object(tweet_json, $.user.location),get_json_object(tweet_json, $.user.time_zone),get_json_object(tweet_json, $.user.created_at);
  • 31. Hive
  • 32. RDBMS vs. HadoopRDBMS HadoopVolumen Gigabyte PetabyteVerarbeitung Ad-Hoc und batch BatchUpdates Viele Lese- undSchreibzugriffeEinmal schreiben,Viele LesezugriffeSchema Statisches Schema Dynamisches SchemaDatenintegrität Hoch NiedrigSkalierverhalten Nicht-Linear Linear
  • 33. Polybase / SQL Server PDW
  • 34. Fragen? ????