Semantic web meetup 14.november 2013

Big Data & Hadoop
Semantic Web Meetup

Jean-Pierre König
03. Oktober 2013

WE ARE HERE
Vom Standort Kreuzlingen / Schweiz
bedient YMC seit 2001 namhafte
nationale und internationale Kunden.

WE CREATE

Hosting &
Support

Web-Strategien

Social-Media-Anwendungen
(z.B. Corporate Blogs, Wikis, Facebook-Apps etc.)

Shop-Systeme, Websites, Intranets

Kundenspezifische
Individuallösungen fürs Web
WEB
SOLUTIONS

Empfehlungssysteme
(z.B. für Apps, Webshops, Websites und Intranet)

Mobile Strategien

MOBILE
APPLICATIONS

BIG DATA
ANALYTICS

Apps für Tablets und Smartphones
(iPhone, Android)

Massgeschneiderte Web Analytics Systeme
(z.B. mit Echtzeit-Metriken und Effekten in
Sozialen Netzwerken)

Integration von Sozialen Netzwerken wie
Facebook und Twitter

Geolokalisierung für
ortsspezifische Services

Vorhersagemodelle
(z.B. für Interessen von App-Usern)

Training
(Apache Hadoop)

Integrierte Suchsysteme
(z.B. auch für unstrukturierte Daten)

WHAT IS BIG DATA
§  More general
§  When data sets become so large and complex that it
becomes difficult to process, including capture, curation,
storage, search, sharing, transfer, analysis, and
visualization
§  It is difficult to work with using most RDBMS, statistic and
visualization systems
§  It requires massively parallel software running on tens,
hundreds, or even thousands of servers

§  The 3 V’s by Gartner
§  Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization. (2012)

WHAT DRIVES BIG DATA
§  Human-generated data
§  Documents, transaction data, CRM, social media
- your working life is devoted to looking at
screens and typing more data into some system.

§  Sensor-generated data
§  There is the trend that a large part of the physical
world around us will eventually somehow be
online – The Internet of Things.

§  Machine-generated data will quickly top
human-generated data

DRIVERS
BUSINESS DRIVES
Fraud protection
Risk management
Environment Safety

Increase
Revenue

Risk
Prevention

360° Customer Experience Management
Digital Security
Social Media Analysis
Infrastructure Observation
(Mass) Personalization
Recommendation Engines
Data as a Service
Research

Improve
DecisionMaking

Data Aggregation
Sampling
Web Archives
Predictive Analytics
Data Pre-processing
Video, Audio & Image Processing
Infrastructure Management

THE EMERGING SOLUTIONS
§  NoSQL* Movement
§  NoSQL databases are finding significant and growing
industry use in big data and real-time web applications.

§  Hadoop and it’s ecosystem
§  Enterprise-grade solutions, consulting, support
§  Top 3 vendors: Cloudera, Hortonworks, MapR
§  Adoption throughout the software industry, e.g. IBM
BigInsights, Microsoft HDInsight, Oracle Big Data
Appliance, EMC/Spring/VMWare Pivotal HD, HP HAVEn,
Intel Distribution, Dell w/Cloudera

Also referred to as "Not only SQL"

WHAT IS HADOOP
§  An open-source implementation of frameworks
for reliable, scalable, distributed computing and
data storage Official Hadoop website
§  A reliable shared storage and analysis system
O‘Reilly: Hadoop – The Definitive Guide

§  A free, Java-based programming framework that
supports the processing of large data sets in a
distributed computing environment Margaret Rouse
§  A complete, open-source ecosystem for
capturing, organizing, storing, searching, sharing,
analyzing, visualizing, and ... Jack Norris

A BRIEF HISTORY OF HADOOP
§  In 2002 Doug Cutting* started with Nutch, a open source web
search engine
§  Fortunately Google published papers, that
§ 

describes the architecture of their distributed filesystem, called GFS
(2003)
§  introduced MapReduce (2004)

§  In 2005 Nutch released a new version with NDFS and
MapReduce and moved out to form an independent subproject
called Hadoop in 2006
§  Cutting joined Yahoo! to build and run Hadoop at web scale
§  In 2008 Hadoop became a top-level Apache project and it was
used at Yahoo! (10k cores), Last.fm, Facebook and New York
Times
*Doug Cutting is also the creator of Apache Lucene

HADOOP IN A NUTSHELL
§  HDFS
§  A distributed file system for storage
§  Is highly fault-tolerant and is designed to be
deployed on low-cost/commodity hardware
§  1 Master called NameNode, many DataNodes(10+)

§  MapReduce
§  A batch query processor to run an ad hoc query
against your whole dataset and get the results in a
reasonable time
§  1 Master called JobTracker, many TaskTrackers (10+)

HADOOP FACT-SHEET
HDFS/distributed storage
§  Economical
§  Commodity hardware

§  Scalable
§  Rebalances data on new nodes

§  Fault Tolerant
§  Detects faults and auto recovers

§  Reliable
§  Maintains multiple copies of data

§  High throughput
§  Because data is distributed

MapReduce/distributed processing
§  Economical
§  Commodity hardware

§  Scalable
§  Add notes to increase parallelism

§  Fault tolerant
§  Auto-recover job failures

§  Data locality
§  Process where the data resides

HADOOP PRINCIPLES
§  Schema on read
§  Data locality
§  No shared memory or disks
§  Scales out to thousands of servers

HADOOP
HADOOP SYSTEM COMPENENTS
Masters

Slaves
(many of them)

HDFS

NameNode

MapReduce

JobTracker

Secondary NameNode

DataNode

TaskTracker

WRITING FILES ON HDFS*
OK, write to DataNodes
1, 5 and 9.

He, i want to write A, B
and C of my File.txt.
File.txt

NameNode
Block A

Client

Block B
Block C

DataNode 6

DataNode 1

DataNode 5

Block A

Block B

Block C`

Block B`

Block A`

Block A`

Block C`
Rack 1

* Replication Factor of 3

Rack 2

DataNode 9
Block C

...

DataNode N
Block B`

READING FILES FROM HDFS
Tell me the block
locations of File.txt.

A à DataNode 1,5,6
B à DataNode 1,5,N
C à DataNode 5,9,6

NameNode
Client

DataNode 6

DataNode 1

DataNode 5

Block A

Block B

Block C`

Block B`

Block A`

Block A`

Block C`
Rack 1

Rack 2

DataNode 9
Block C

...

DataNode N
Block B`

MAPREDUCE IN A NUTSHELL
Input

Split

Deer Car Bear

Word Count Example

Bear, 2

Car, 3

Deer, 1
Deer, 1

Car Car River

Reduce

Car, 1
Car, 1
Car, 1
Deer Bear River
Car Car River
Deer Car Bear

Shuffle
Bear, 1
Bear, 1

Deer Bear River

Map

Deer, 2

River, 1
River, 1

River, 2

Result

Deer, 1
Bear, 1
River, 1

Bear, 2
Car, 3
Deer, 2
River, 2

Car, 1
Car, 1
River, 1

Deer, 1
Car, 1
Bear, 1

MAPREDUCE VS. RDBMS
§  RDBMS
§ 

In a centralized database system, you’ve got one big disk connected to
4 or 8 or 16 big processors.

§  MapReduce
§ 

In a Hadoop cluster, every server has 2 or 4 or 8 CPUs. You can run
your job by sending your code to each of the dozens of servers in your
cluster, and each server operates on its own little piece of the data.
Results are then delivered back to you in a unified whole. You map the
operation out to all of those servers and then you reduce the results
back into a single result set.

§  Architecturally, the reason you’re able to deal with lots of data is
because Hadoop spreads it out. And the reason you’re able to
ask complicated computational questions is because you’ve got
all of these processors, working in parallel, harnessed together.

HADOOP’S DATABASE HBASE*
§  Unlike RDMS
§  No secondary indexes
§  No transactions
§  De-normalized, Schema less

§  Random read/write access to big data
§  Billions of rows and millions of columns
§  Automatic data sharding
§  Integrates with MapReduce
* Modeled after Google’s BigTable

USE CASES
Data Warehousing

§  Complementary ETL process
File
Server

Analytics

OLTP
Data
Warehouse

ETL

Visualization

CRM
Reports

ERP

Data Marts
Data Cubes

...
Logs Logs
Logs

PIG
Social
Media
Sensors

...

Sqoop
Flume
Java API

Hive

MapReduce

HDFS

USE CASES
Data Warehousing

§  Substitutive ETL process
File
Server

Analytics

OLTP
Hadoop

Data
Warehouse

Visualization

CRM
ERP

...
Logs Logs
Logs

Social
Media
Sensors

...

Reports

USE CASES
Data Warehousing

§  (Predictive) Analytics at scale
File
Server

Analytics

OLTP
Hadoop

Data
Warehouse

Visualization

CRM
ERP

...
Lo Logs
Logs
gs
Social
Media
Sensors

...

Reports

USE CASES
Data Warehousing

§  Machine Learning, Natural language processing, sentiment at scale
File
Server

OLTP

Analytics

ML +NLP

*
Hadoop

Data
Warehouse

Visualization

CRM
ERP

Reports

...
Lo Logs
Logs
gs
Social
Media
Sensors

...

* Personalized recommendations
§  content, products, services …

CONTACT US
jean-pierre.koenig@ymc.ch
Tel. +41 (0)71 508 24 86
www.ymc.ch
@YMC_Big_Data

YMC AG
Sonnenstrasse 4
CH-8280 Kreuzlingen
Switzerland

Photo Credits:
Slide 03: Matterhorn and Lake by Noel Reynolds
Slde 24: Hadoop Ecosystem by Rishu Shrivastava

Semantic web meetup 14.november 2013

More Related Content

What's hot

Viewers also liked

Similar to Semantic web meetup 14.november 2013

Recently uploaded

Semantic web meetup 14.november 2013