Your SlideShare is downloading. ×
0
Big Data & Hadoop
Semantic Web Meetup

Jean-Pierre König
03. Oktober 2013
COMPANY
PROFILE
WE ARE HERE
Vom Standort Kreuzlingen / Schweiz
bedient YMC seit 2001 namhafte
nationale und internationale Kunden.
WE WORK WITH
Customers
WE WORK WITH
Partners
WE CREATE

Hosting &
Support

Web-Strategien

Social-Media-Anwendungen
(z.B. Corporate Blogs, Wikis, Facebook-Apps etc.)

...
WHAT IS
BIG DATA
WHAT IS BIG DATA
§  More general
§  When data sets become so large and complex that it
becomes difficult to process, inc...
WHAT DRIVES BIG DATA
§  Human-generated data
§  Documents, transaction data, CRM, social media
- your working life is de...
DRIVERS
BUSINESS DRIVES
Fraud protection
Risk management
Environment Safety

Increase
Revenue

Risk
Prevention

360° Custo...
THE EMERGING SOLUTIONS
§  NoSQL* Movement
§  NoSQL databases are finding significant and growing
industry use in big dat...
HADOOP
IN A NUTSHELL
WHAT IS HADOOP
§  An open-source implementation of frameworks
for reliable, scalable, distributed computing and
data stor...
A BRIEF HISTORY OF HADOOP
§  In 2002 Doug Cutting* started with Nutch, a open source web
search engine
§  Fortunately Go...
HADOOP IN A NUTSHELL
§  HDFS
§  A distributed file system for storage
§  Is highly fault-tolerant and is designed to be...
HADOOP FACT-SHEET
HDFS/distributed storage
§  Economical
§  Commodity hardware

§  Scalable
§  Rebalances data on new ...
HADOOP PRINCIPLES
§  Schema on read
§  Data locality
§  No shared memory or disks
§  Scales out to thousands of server...
HADOOP
HADOOP SYSTEM COMPENENTS
Masters

Slaves
(many of them)

HDFS

NameNode

MapReduce

JobTracker

Secondary NameNode
...
WRITING FILES ON HDFS*
OK, write to DataNodes
1, 5 and 9.

He, i want to write A, B
and C of my File.txt.
File.txt

NameNo...
READING FILES FROM HDFS
Tell me the block
locations of File.txt.

A à DataNode 1,5,6
B à DataNode 1,5,N
C à DataNode 5,...
MAPREDUCE IN A NUTSHELL
Input

Split

Deer Car Bear

Word Count Example

Bear, 2

Car, 3

Deer, 1
Deer, 1

Car Car River

...
MAPREDUCE VS. RDBMS
§  RDBMS
§ 

In a centralized database system, you’ve got one big disk connected to
4 or 8 or 16 big...
ECOSYSTEM
HADOOP
HADOOP ECOSYSTEM
HADOOP’S DATABASE HBASE*
§  Unlike RDMS
§  No secondary indexes
§  No transactions
§  De-normalized, Schema less

§  ...
USE CASES
HADOOP
USE CASES
Data Warehousing

§  Complementary ETL process
File
Server

Analytics

OLTP
Data
Warehouse

ETL

Visualization
...
USE CASES
Data Warehousing

§  Substitutive ETL process
File
Server

Analytics

OLTP
Hadoop

Data
Warehouse

Visualizatio...
USE CASES
Data Warehousing

§  (Predictive) Analytics at scale
File
Server

Analytics

OLTP
Hadoop

Data
Warehouse

Visua...
USE CASES
Data Warehousing

§  Machine Learning, Natural language processing, sentiment at scale
File
Server

OLTP

Analy...
THANK
YOU!
CONTACT US
jean-pierre.koenig@ymc.ch
Tel. +41 (0)71 508 24 86
www.ymc.ch
@YMC_Big_Data

YMC AG
Sonnenstrasse 4
CH-8280 Kre...
Upcoming SlideShare
Loading in...5
×

Semantic web meetup 14.november 2013

386

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
386
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Semantic web meetup 14.november 2013"

  1. 1. Big Data & Hadoop Semantic Web Meetup Jean-Pierre König 03. Oktober 2013
  2. 2. COMPANY PROFILE
  3. 3. WE ARE HERE Vom Standort Kreuzlingen / Schweiz bedient YMC seit 2001 namhafte nationale und internationale Kunden.
  4. 4. WE WORK WITH Customers
  5. 5. WE WORK WITH Partners
  6. 6. WE CREATE Hosting & Support Web-Strategien Social-Media-Anwendungen (z.B. Corporate Blogs, Wikis, Facebook-Apps etc.) Shop-Systeme, Websites, Intranets Kundenspezifische Individuallösungen fürs Web WEB SOLUTIONS Empfehlungssysteme (z.B. für Apps, Webshops, Websites und Intranet) Mobile Strategien MOBILE APPLICATIONS BIG DATA ANALYTICS Apps für Tablets und Smartphones (iPhone, Android) Massgeschneiderte Web Analytics Systeme (z.B. mit Echtzeit-Metriken und Effekten in Sozialen Netzwerken) Integration von Sozialen Netzwerken wie Facebook und Twitter Geolokalisierung für ortsspezifische Services Vorhersagemodelle (z.B. für Interessen von App-Usern) Training (Apache Hadoop) Integrierte Suchsysteme (z.B. auch für unstrukturierte Daten)
  7. 7. WHAT IS BIG DATA
  8. 8. WHAT IS BIG DATA §  More general §  When data sets become so large and complex that it becomes difficult to process, including capture, curation, storage, search, sharing, transfer, analysis, and visualization §  It is difficult to work with using most RDBMS, statistic and visualization systems §  It requires massively parallel software running on tens, hundreds, or even thousands of servers §  The 3 V’s by Gartner §  Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. (2012)
  9. 9. WHAT DRIVES BIG DATA §  Human-generated data §  Documents, transaction data, CRM, social media - your working life is devoted to looking at screens and typing more data into some system. §  Sensor-generated data §  There is the trend that a large part of the physical world around us will eventually somehow be online – The Internet of Things. §  Machine-generated data will quickly top human-generated data
  10. 10. DRIVERS BUSINESS DRIVES Fraud protection Risk management Environment Safety Increase Revenue Risk Prevention 360° Customer Experience Management Digital Security Social Media Analysis Infrastructure Observation (Mass) Personalization Recommendation Engines Data as a Service Research Improve DecisionMaking Data Aggregation Sampling Web Archives Predictive Analytics Data Pre-processing Video, Audio & Image Processing Infrastructure Management
  11. 11. THE EMERGING SOLUTIONS §  NoSQL* Movement §  NoSQL databases are finding significant and growing industry use in big data and real-time web applications. §  Hadoop and it’s ecosystem §  Enterprise-grade solutions, consulting, support §  Top 3 vendors: Cloudera, Hortonworks, MapR §  Adoption throughout the software industry, e.g. IBM BigInsights, Microsoft HDInsight, Oracle Big Data Appliance, EMC/Spring/VMWare Pivotal HD, HP HAVEn, Intel Distribution, Dell w/Cloudera Also referred to as "Not only SQL"
  12. 12. HADOOP IN A NUTSHELL
  13. 13. WHAT IS HADOOP §  An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage Official Hadoop website §  A reliable shared storage and analysis system O‘Reilly: Hadoop – The Definitive Guide §  A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment Margaret Rouse §  A complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and ... Jack Norris
  14. 14. A BRIEF HISTORY OF HADOOP §  In 2002 Doug Cutting* started with Nutch, a open source web search engine §  Fortunately Google published papers, that §  describes the architecture of their distributed filesystem, called GFS (2003) §  introduced MapReduce (2004) §  In 2005 Nutch released a new version with NDFS and MapReduce and moved out to form an independent subproject called Hadoop in 2006 §  Cutting joined Yahoo! to build and run Hadoop at web scale §  In 2008 Hadoop became a top-level Apache project and it was used at Yahoo! (10k cores), Last.fm, Facebook and New York Times *Doug Cutting is also the creator of Apache Lucene
  15. 15. HADOOP IN A NUTSHELL §  HDFS §  A distributed file system for storage §  Is highly fault-tolerant and is designed to be deployed on low-cost/commodity hardware §  1 Master called NameNode, many DataNodes(10+) §  MapReduce §  A batch query processor to run an ad hoc query against your whole dataset and get the results in a reasonable time §  1 Master called JobTracker, many TaskTrackers (10+)
  16. 16. HADOOP FACT-SHEET HDFS/distributed storage §  Economical §  Commodity hardware §  Scalable §  Rebalances data on new nodes §  Fault Tolerant §  Detects faults and auto recovers §  Reliable §  Maintains multiple copies of data §  High throughput §  Because data is distributed MapReduce/distributed processing §  Economical §  Commodity hardware §  Scalable §  Add notes to increase parallelism §  Fault tolerant §  Auto-recover job failures §  Data locality §  Process where the data resides
  17. 17. HADOOP PRINCIPLES §  Schema on read §  Data locality §  No shared memory or disks §  Scales out to thousands of servers
  18. 18. HADOOP HADOOP SYSTEM COMPENENTS Masters Slaves (many of them) HDFS NameNode MapReduce JobTracker Secondary NameNode DataNode TaskTracker
  19. 19. WRITING FILES ON HDFS* OK, write to DataNodes 1, 5 and 9. He, i want to write A, B and C of my File.txt. File.txt NameNode Block A Client Block B Block C DataNode 6 DataNode 1 DataNode 5 Block A Block B Block C` Block B` Block A` Block A` Block C` Rack 1 * Replication Factor of 3 Rack 2 DataNode 9 Block C ... DataNode N Block B`
  20. 20. READING FILES FROM HDFS Tell me the block locations of File.txt. A à DataNode 1,5,6 B à DataNode 1,5,N C à DataNode 5,9,6 NameNode Client DataNode 6 DataNode 1 DataNode 5 Block A Block B Block C` Block B` Block A` Block A` Block C` Rack 1 Rack 2 DataNode 9 Block C ... DataNode N Block B`
  21. 21. MAPREDUCE IN A NUTSHELL Input Split Deer Car Bear Word Count Example Bear, 2 Car, 3 Deer, 1 Deer, 1 Car Car River Reduce Car, 1 Car, 1 Car, 1 Deer Bear River Car Car River Deer Car Bear Shuffle Bear, 1 Bear, 1 Deer Bear River Map Deer, 2 River, 1 River, 1 River, 2 Result Deer, 1 Bear, 1 River, 1 Bear, 2 Car, 3 Deer, 2 River, 2 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1
  22. 22. MAPREDUCE VS. RDBMS §  RDBMS §  In a centralized database system, you’ve got one big disk connected to 4 or 8 or 16 big processors. §  MapReduce §  In a Hadoop cluster, every server has 2 or 4 or 8 CPUs. You can run your job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. You map the operation out to all of those servers and then you reduce the results back into a single result set. §  Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.
  23. 23. ECOSYSTEM HADOOP
  24. 24. HADOOP ECOSYSTEM
  25. 25. HADOOP’S DATABASE HBASE* §  Unlike RDMS §  No secondary indexes §  No transactions §  De-normalized, Schema less §  Random read/write access to big data §  Billions of rows and millions of columns §  Automatic data sharding §  Integrates with MapReduce * Modeled after Google’s BigTable
  26. 26. USE CASES HADOOP
  27. 27. USE CASES Data Warehousing §  Complementary ETL process File Server Analytics OLTP Data Warehouse ETL Visualization CRM Reports ERP Data Marts Data Cubes ... Logs Logs Logs PIG Social Media Sensors ... Sqoop Flume Java API Hive MapReduce HDFS
  28. 28. USE CASES Data Warehousing §  Substitutive ETL process File Server Analytics OLTP Hadoop Data Warehouse Visualization CRM ERP ... Logs Logs Logs Social Media Sensors ... Reports
  29. 29. USE CASES Data Warehousing §  (Predictive) Analytics at scale File Server Analytics OLTP Hadoop Data Warehouse Visualization CRM ERP ... Lo Logs Logs gs Social Media Sensors ... Reports
  30. 30. USE CASES Data Warehousing §  Machine Learning, Natural language processing, sentiment at scale File Server OLTP Analytics ML +NLP * Hadoop Data Warehouse Visualization CRM ERP Reports ... Lo Logs Logs gs Social Media Sensors ... * Personalized recommendations §  content, products, services …
  31. 31. THANK YOU!
  32. 32. CONTACT US jean-pierre.koenig@ymc.ch Tel. +41 (0)71 508 24 86 www.ymc.ch @YMC_Big_Data YMC AG Sonnenstrasse 4 CH-8280 Kreuzlingen Switzerland Photo Credits: Slide 03: Matterhorn and Lake by Noel Reynolds Slde 24: Hadoop Ecosystem by Rishu Shrivastava
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×