More Related Content
Similar to PASS Camp 2012 - Big Data mit Microsoft (Teil 1) (17)
More from Sascha Dittmann (18)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
- 1. PASS Camp 2012
Big Data mit Microsoft (Teil 1)
Software Developer / Solution Architect
Twitter: @SaschaDittmann
Blog: http://www.sascha-dittmann.de
- 2. Was könnte das sein?
180.000.000.000.000.000.000
1.800.000.000.000.000.000.000
- 3. Weltweites Datenvolumen
180.000.000.000.000.000.000
= 0,18 ZB (Zettabytes) - Stand 2006
1.800.000.000.000.000.000.000
= 1,8 ZB (Zettabytes) - Stand 2011
- 4. Skalierung
Vertikale Skalierung Horizontale Skalierung
- 5. Apache Hadoop Ecosystem
Oozie HBase / Cassandra
Traditional BI Tools
(Workflow) (Columnar NoSQL Databases)
Hive
Cascading
Pig (Data (Warehouse Apache
(programming Flume Sqoop
Flow) and Data Mahout
model)
Access)
Zookeeper (Coordination)
Avro (Serialization)
HBase (Column DB)
MapReduce (Job Scheduling/Execution System)
Hadoop = MapReduce + HDFS
HDFS
(Hadoop Distributed File System)
- 6. Apache Hadoop Ecosystem
Visual Studio
Oozie HBase / Cassandra
Traditional BI Tools
(Workflow) (Columnar NoSQL Databases)
Hive
Cascading
Pig (Data (Warehouse Apache
(programming Flume Sqoop
Flow) and Data Mahout
model)
Access)
Active Directory
System Center
Zookeeper (Coordination)
Avro (Serialization)
HBase (Column DB)
MapReduce (Job Scheduling/Execution System)
Hadoop = MapReduce + HDFS
HDFS
(Hadoop Distributed File System)
Windows
- 10. Hadoop Distributed File System
Portable Operating System Interface (POSIX)
Replikation auf mehrere Datenknoten
js> #ls input/ncdc
Found 9 items
drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:01 /user/Sascha/input/ncdc/_distcp_logs_g0dedn
drwxr-xr-x - Sascha supergroup 0 2012-04-24 12:04 /user/Sascha/input/ncdc/_distcp_logs_ofj0u6
drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:09 /user/Sascha/input/ncdc/all
drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:01 /user/Sascha/input/ncdc/all2
drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/metadata
drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro
drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro-tab
-rw-r--r-- 3 Sascha supergroup 529 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt
-rw-r--r-- 3 Sascha supergroup 168 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
- 11. Map / Reduce
DataNode DataNode DataNode 0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1952,-11
1950,22
Map Map Map 1950,55
1950,33
Sort Sort Sort 1949,0
1950,[22,33,55]
Shuffle Shuffle Shuffle 1952,-11
Reduce
1949,0
1950,55
1952,-11
- 12. Combine Methode
DataNode DataNode DataNode 0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1952,-11
1950,22
Map Map Map 1950,55
1950,33
1949,0 1952,-11
Combine Combine Combine 1950,55 1950,33
Sort Sort Sort 1949,0
1950,[33,55]
Shuffle Shuffle Shuffle 1952,-11
Reduce
1949,0
1950,55
1952,-11
- 13. RDBMS vs. Hadoop
RDBMS Hadoop
Datenmenge Gigabytes Petabytes
Zugriff Interaktiv und Batch Batch
Lese- / Schreibzugriffe Viele Lese- und Einmaliges Schreiben
Schreibzugriffe Viele Lesezugriffe
Datenstruktur Statisches Schema Dynamisches Schema
Datenintegrität Hoch Niedrig
Skalierungsverhalten Nicht-Linear Linear
- 14. Demo‘s
Hadoop Umgebung
HDFS
Map/Reduce via JavaScript
Data Streaming mit C#
Power Pivot
- 15. Pig Latin
pig
.from("/user/Sascha/input/texte")
.mapReduce("/user/…/WordCount.js"
, "Woerter, Anzahl:long")
.orderBy("Anzahl DESC")
.take(15)
.to("/user/Sascha/output/Top15Woerter")