SlideShare a Scribd company logo
Sascha Dittmann
Blog: http://www.sascha-dittmann.de
Twitter: @SaschaDittmann
Microsoft HDInsight für .NET Entwickler
Big Data Analysen mit JavaScript und C#
Large Hadron Collider (CERN Schweiz)
http://public.web.cern.ch/public/en/lhc/Computing-en.html
Der LHC Teilchenbeschleuniger
produziert 15 PB Messdaten pro Jahr*
Woher kommt Big Data
70% of U.S.
smartphone owners
regularly shop online
via their devices.
44% of users
(350M people)
access Facebook via
mobile devices.
50% of
millennials use
mobile devices to
research products.
60%of U.S.
mobile data will be
audio and video
streaming by 2014.
Mobility
2/3of the world's
mobile data traffic will
be video by 2016.
33%of BI will
be consumed via
handheld devices
by 2013.
Gaming consoles are
now used an average of
1.5 hrs/wk
to connect to the
Internet.
80%growth of
unstructured data is
predicted over the
next five years.
1.8 zettabytes
of digital data were
in use
worldwide in
2011, up 30%
from 2010.
1 in 4
Facebook users
add their location
to posts
(2B/month).
500M Tweets
are hosted on
Twitter each day.
38% of people
recommend a brand
they “like” or follow
on a social network.
100M
Facebook
“likes” per day.
Brands get
Big
Data
Social
Mobility Cloud
Big Data Szenarien
Web app
optimization
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting
Natural resource
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
optimization
Legal
discovery
Big Data ist sexy
http://hbr.org/
Apache Hadoop Ecosystem
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig (Data
Flow)
Hive
(Warehouse
and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programming
model)
Hadoop = MapReduce + HDFS
Flume
Microsoft HDInsight
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig
(Data
Flow)
Hive
(Warehous
e and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programmin
g model)
Hadoop = MapReduce + HDFS
Flume
Windows
SystemCenter
ActiveDirectory
Visual Studio
Hadoop Distributed File System (HDFS)
Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Hadoop Distributed File System (HDFS)
Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
 Portable Operating System Interface (POSIX)
 Replikation auf mehrere Datenknoten
js> #ls /user/Sascha/input/ncdc
Found 9 items
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab
-rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt
-rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
HDInsight Dashboard Demo
Map/Reduce am Beispiel von Messdaten
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Jahr Lufttemperatur
Map/Reduce am Beispiel von Messdaten
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Messqualität
Map/Reduce
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,[22,33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combine Methode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce am Beispiel von Messdaten
Wörter zählen mit JavaScript (Map)
Wörter zählen mit JavaScript (Reduce)
Map/Reduce mit JavaScript
Verfeinern mit Pig Latin
pig
.from("/user/Sascha/input/texte")
.mapReduce("/user/…/WordCount.js"
, "Woerter, Anzahl:long")
.orderBy("Anzahl DESC")
.take(15)
.to("/user/Sascha/output/Top15Woerter")
Pig Latin
Wörter zählen mit C# (Map - Classic)
Wörter zählen mit C# (Reduce - Classic)
Map/Reduce mit C#
.NET Job Submission Framework (Map)
.NET Job Submission Framework (Reduce)
Externe Hive-Tabelle erzeugen
CREATE EXTERNAL TABLE twitter_raw
(
tweet_json STRING
)
COMMENT 'Twitter Sample Data'
ROW FORMAT DELIMITED LINES TERMINATED
BY '10'
STORED AS TEXTFILE
LOCATION '/example/twitterdata';
Twitter JSON
{
"possibly_sensitive_editable":true,
"place":null,
"text":"Pre - #ConvCloud chat insights. " #Cloud Security, are we missing the point?" from
@christianve http://t.co/Smo0CPvb #HP #cloudsource”,
"id_str":"223418953114984448”,
"favorited":false,
"possibly_sensitive":false,
"created_at":"Thu Jul 12 14:10:04 +0000 2012",
"retweeted":false,
"retweet_count":0,
"user":{
"is_translator":false,
"profile_use_background_image":true,
"profile_image_url_https":"https://si0.twimg.com/profile_images/640456324/
Paul_Calento_normal.jpg",
"id_str":"103006513",
"profile_text_color":"333333",
"statuses_count":5984,
"following":null,
"followers_count":744,
"default_profile_image":false,
"profile_link_color":"FF3300",
}, …..
}
JSON in Hive interpretieren
FROM twitter_raw
INSERT OVERRIDE TABLE twitter_temp
SELECT get_json_object(tweet_json, '$.created_at'),
substr(get_json_object(tweet_json, '$.created_at'),9,2),
substr(get_json_object(tweet_json, '$.created_at'),12,8),
get_json_object(tweet_json, '$.in_reply_to_user_id_str'),
get_json_object(tweet_json, '$.text'),
get_json_object(tweet_json, '$.contributors'),
get_json_object(tweet_json, '$.retweeted'),
get_json_object(tweet_json, '$.truncated'),
get_json_object(tweet_json, '$.favorited'),
cast(get_json_object(tweet_json, '$.retweet_count') as int),
/* … */
get_json_object(tweet_json, '$.user.profile_image_url_https'),
cast(get_json_object(tweet_json, '$.user.followers_count') as int),
get_json_object(tweet_json, '$.user.location'),
get_json_object(tweet_json, '$.user.time_zone'),
get_json_object(tweet_json, '$.user.created_at');
Hive
RDBMS vs. Hadoop
RDBMS Hadoop
Volumen Gigabyte Petabyte
Verarbeitung Ad-Hoc und batch Batch
Updates Viele Lese- und
Schreibzugriffe
Einmal schreiben,
Viele Lesezugriffe
Schema Statisches Schema Dynamisches Schema
Datenintegrität Hoch Niedrig
Skalierverhalten Nicht-Linear Linear
Polybase / SQL Server PDW
Fragen
? ?
?
?
?

More Related Content

Similar to dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler

Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scalejgoulah
 
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
FIDE Master Tihomir Dovramadjiev PhD
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Geoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Geoffrey Fox
 
Azureday 2020 - The Edge talks - long road into the Cloud​
Azureday 2020 - The Edge talks - long road into the Cloud​Azureday 2020 - The Edge talks - long road into the Cloud​
Azureday 2020 - The Edge talks - long road into the Cloud​
Rafal Warzycha
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Guía de usuario
Guía de usuarioGuía de usuario
Guía de usuario
Se Aprender
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
Frank Calberg
 
Making a Better World with Technology Innovations
Making a Better World with Technology InnovationsMaking a Better World with Technology Innovations
Making a Better World with Technology Innovations
Imesh Gunaratne
 
Big data, open data and telepathy: technologies for smart, human-scale cities...
Big data, open data and telepathy: technologies for smart, human-scale cities...Big data, open data and telepathy: technologies for smart, human-scale cities...
Big data, open data and telepathy: technologies for smart, human-scale cities...
Rick Robinson
 
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
Athens Big Data
 
How it works- Data Science
How it works- Data ScienceHow it works- Data Science
How it works- Data Science
Edureka!
 
Francis da costa rethinks the internet of things zd_net
Francis da costa rethinks the internet of things   zd_netFrancis da costa rethinks the internet of things   zd_net
Francis da costa rethinks the internet of things zd_net
MeshDynamics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Edureka!
 
Kinectic vision looking deep into depth
Kinectic vision   looking deep into depthKinectic vision   looking deep into depth
Kinectic vision looking deep into depth
ppd1961
 
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKTRealtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Metatron
 
Vinay Reddy resume
Vinay Reddy resumeVinay Reddy resume
Vinay Reddy resumeVinay Reddy
 
Ds latino alejandrov4
Ds latino alejandrov4Ds latino alejandrov4
Ds latino alejandrov4
alejandro_xf
 
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsA novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
Hoopeer Hoopeer
 
Web 2.0 NY: When Products Start Talking Back
Web 2.0 NY: When Products Start Talking BackWeb 2.0 NY: When Products Start Talking Back
Web 2.0 NY: When Products Start Talking Back
Garrick Schmitt
 

Similar to dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler (20)

Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
 
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
Tihomir Dovramadjiev PhD. BLENDER ANIMATION. 3D Video Fantasy Battle Animatio...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
Azureday 2020 - The Edge talks - long road into the Cloud​
Azureday 2020 - The Edge talks - long road into the Cloud​Azureday 2020 - The Edge talks - long road into the Cloud​
Azureday 2020 - The Edge talks - long road into the Cloud​
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Guía de usuario
Guía de usuarioGuía de usuario
Guía de usuario
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Making a Better World with Technology Innovations
Making a Better World with Technology InnovationsMaking a Better World with Technology Innovations
Making a Better World with Technology Innovations
 
Big data, open data and telepathy: technologies for smart, human-scale cities...
Big data, open data and telepathy: technologies for smart, human-scale cities...Big data, open data and telepathy: technologies for smart, human-scale cities...
Big data, open data and telepathy: technologies for smart, human-scale cities...
 
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
 
How it works- Data Science
How it works- Data ScienceHow it works- Data Science
How it works- Data Science
 
Francis da costa rethinks the internet of things zd_net
Francis da costa rethinks the internet of things   zd_netFrancis da costa rethinks the internet of things   zd_net
Francis da costa rethinks the internet of things zd_net
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Kinectic vision looking deep into depth
Kinectic vision   looking deep into depthKinectic vision   looking deep into depth
Kinectic vision looking deep into depth
 
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKTRealtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
 
Vinay Reddy resume
Vinay Reddy resumeVinay Reddy resume
Vinay Reddy resume
 
Ds latino alejandrov4
Ds latino alejandrov4Ds latino alejandrov4
Ds latino alejandrov4
 
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsA novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
 
Web 2.0 NY: When Products Start Talking Back
Web 2.0 NY: When Products Start Talking BackWeb 2.0 NY: When Products Start Talking Back
Web 2.0 NY: When Products Start Talking Back
 

More from Sascha Dittmann

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
Sascha Dittmann
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Sascha Dittmann
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
Sascha Dittmann
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
Sascha Dittmann
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
Sascha Dittmann
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
Sascha Dittmann
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Sascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile ServicesSascha Dittmann
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
Sascha Dittmann
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
Sascha Dittmann
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudSascha Dittmann
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
Sascha Dittmann
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureSascha Dittmann
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Sascha Dittmann
 

More from Sascha Dittmann (18)

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Services
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die Cloud
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
 
Big Data & NoSQL
Big Data & NoSQLBig Data & NoSQL
Big Data & NoSQL
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und Azure
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler

  • 1. Sascha Dittmann Blog: http://www.sascha-dittmann.de Twitter: @SaschaDittmann Microsoft HDInsight für .NET Entwickler Big Data Analysen mit JavaScript und C#
  • 2. Large Hadron Collider (CERN Schweiz) http://public.web.cern.ch/public/en/lhc/Computing-en.html Der LHC Teilchenbeschleuniger produziert 15 PB Messdaten pro Jahr*
  • 3. Woher kommt Big Data 70% of U.S. smartphone owners regularly shop online via their devices. 44% of users (350M people) access Facebook via mobile devices. 50% of millennials use mobile devices to research products. 60%of U.S. mobile data will be audio and video streaming by 2014. Mobility 2/3of the world's mobile data traffic will be video by 2016. 33%of BI will be consumed via handheld devices by 2013. Gaming consoles are now used an average of 1.5 hrs/wk to connect to the Internet. 80%growth of unstructured data is predicted over the next five years. 1.8 zettabytes of digital data were in use worldwide in 2011, up 30% from 2010. 1 in 4 Facebook users add their location to posts (2B/month). 500M Tweets are hosted on Twitter each day. 38% of people recommend a brand they “like” or follow on a social network. 100M Facebook “likes” per day. Brands get Big Data Social Mobility Cloud
  • 4. Big Data Szenarien Web app optimization Smart meter monitoring Equipment monitoring Advertising analysis Life sciences research Fraud detection Healthcare outcomes Weather forecasting Natural resource exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure optimization Legal discovery
  • 5. Big Data ist sexy http://hbr.org/
  • 6. Apache Hadoop Ecosystem MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehouse and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programming model) Hadoop = MapReduce + HDFS Flume
  • 7. Microsoft HDInsight MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehous e and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programmin g model) Hadoop = MapReduce + HDFS Flume Windows SystemCenter ActiveDirectory Visual Studio
  • 8. Hadoop Distributed File System (HDFS) Bootvorgang Ausfallsicherheit Benutzeranfrage
  • 9. Hadoop Distributed File System (HDFS) Bootvorgang Ausfallsicherheit Benutzeranfrage
  • 11. Hadoop Distributed File System (HDFS)  Portable Operating System Interface (POSIX)  Replikation auf mehrere Datenknoten js> #ls /user/Sascha/input/ncdc Found 9 items drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2 drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab -rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt -rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
  • 13. Map/Reduce am Beispiel von Messdaten 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Jahr Lufttemperatur
  • 14. Map/Reduce am Beispiel von Messdaten 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Messqualität
  • 16. Map/Reduce mit Combine Methode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11
  • 17. Map/Reduce am Beispiel von Messdaten
  • 18. Wörter zählen mit JavaScript (Map)
  • 19. Wörter zählen mit JavaScript (Reduce)
  • 21. Verfeinern mit Pig Latin pig .from("/user/Sascha/input/texte") .mapReduce("/user/…/WordCount.js" , "Woerter, Anzahl:long") .orderBy("Anzahl DESC") .take(15) .to("/user/Sascha/output/Top15Woerter")
  • 23. Wörter zählen mit C# (Map - Classic)
  • 24. Wörter zählen mit C# (Reduce - Classic)
  • 26. .NET Job Submission Framework (Map)
  • 27. .NET Job Submission Framework (Reduce)
  • 28. Externe Hive-Tabelle erzeugen CREATE EXTERNAL TABLE twitter_raw ( tweet_json STRING ) COMMENT 'Twitter Sample Data' ROW FORMAT DELIMITED LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION '/example/twitterdata';
  • 29. Twitter JSON { "possibly_sensitive_editable":true, "place":null, "text":"Pre - #ConvCloud chat insights. " #Cloud Security, are we missing the point?" from @christianve http://t.co/Smo0CPvb #HP #cloudsource”, "id_str":"223418953114984448”, "favorited":false, "possibly_sensitive":false, "created_at":"Thu Jul 12 14:10:04 +0000 2012", "retweeted":false, "retweet_count":0, "user":{ "is_translator":false, "profile_use_background_image":true, "profile_image_url_https":"https://si0.twimg.com/profile_images/640456324/ Paul_Calento_normal.jpg", "id_str":"103006513", "profile_text_color":"333333", "statuses_count":5984, "following":null, "followers_count":744, "default_profile_image":false, "profile_link_color":"FF3300", }, ….. }
  • 30. JSON in Hive interpretieren FROM twitter_raw INSERT OVERRIDE TABLE twitter_temp SELECT get_json_object(tweet_json, '$.created_at'), substr(get_json_object(tweet_json, '$.created_at'),9,2), substr(get_json_object(tweet_json, '$.created_at'),12,8), get_json_object(tweet_json, '$.in_reply_to_user_id_str'), get_json_object(tweet_json, '$.text'), get_json_object(tweet_json, '$.contributors'), get_json_object(tweet_json, '$.retweeted'), get_json_object(tweet_json, '$.truncated'), get_json_object(tweet_json, '$.favorited'), cast(get_json_object(tweet_json, '$.retweet_count') as int), /* … */ get_json_object(tweet_json, '$.user.profile_image_url_https'), cast(get_json_object(tweet_json, '$.user.followers_count') as int), get_json_object(tweet_json, '$.user.location'), get_json_object(tweet_json, '$.user.time_zone'), get_json_object(tweet_json, '$.user.created_at');
  • 31. Hive
  • 32. RDBMS vs. Hadoop RDBMS Hadoop Volumen Gigabyte Petabyte Verarbeitung Ad-Hoc und batch Batch Updates Viele Lese- und Schreibzugriffe Einmal schreiben, Viele Lesezugriffe Schema Statisches Schema Dynamisches Schema Datenintegrität Hoch Niedrig Skalierverhalten Nicht-Linear Linear
  • 33. Polybase / SQL Server PDW