SlideShare a Scribd company logo
Setting up a Big Data Platform at
Kelkoo
Data Platform
Fabrice dos Santos
1st of Sep. 2015
Kelkoo DataPlatform / Big Data ?
• “Football is a simple game. Twenty-two men chase a ball for 90 minutes and at
the end, the Germans always win”
• And why do they win ?
– Because they use big data !
– German team partnered with German software giant SAP AG to create a custom
match analysis tool that collects and analyzes massive amounts of player performance data.
• Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization, andinformation privacy.
• http://blogs.wsj.com/cio/2014/07/10/germanys-12th-man-at-the-world-cup-big-
data/
• https://www.youtube.com/watch?v=JX5NLUViIMc
• http://www.lesechos.fr/idees-debats/cercle/cercle-111048-le-monde-nouveau-du-
big-data-1047390.php
Gary
Lineker
Kelkoo DataPlatform transitioninng :::
AGENDA & Goal
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
• GOAL
• give you the core concepts of
hadoop platform @ Kelkoo
• understand dataflow
• starts getting used with vocabulary
1/ Kelkoo DataPlatform :: Flume
Flume Agent
Acheminent des
données
HDFS
Stockage des données
• Name node / Datanodes
• HDFS INPUTS: LOGS
• HDFS OUTPUT: REPORTS
Spark on Yarn
Analyse et
traitement des
données
• ResourceManager /
Nodemanager
• Apache Flume is a distributed, reliable, and available
system for efficiently collecting, aggregating and
moving large amounts of log data from many
different sources to a centralized data store.
FLUME AGENT
(kelkoo_a1)
Flume:: Core concepts
Rece
ive
even
ts
Extern
al
client
(ECS,
KLS
etc…)
Push
even
ts
Source
(rImp,
rLead)
POLL
Chann
el (cImp,
cLead)
Forw
ard
event
s
Sink
(sImp,sLe
ad)
Read
events
HDFS
• Event : unit of data transported by Flume
• [ Header (timestamp, hostname …) | Body
(data)]
• Client (ECS,KLS): point of origin of events
that deliver them to a Flume agent
• Flume agent (kelkooFlume_a?) jvm process:
• Source: consume events and hands it
over to the channel
• Channel: buffers incoming events until a
sink drains them for further transport =>
reliability
• Sink: remove events from a channel and
transmit it to next agent (HDFS sink
here)
• Channel periodically writes a backup
check point out to disk => recoverability
• HDFS storage : terminal repository
Checkpoint
/opt/kookel/data
/kelkooFlume
Flume @ Kelkoo
• Source type
– We use Avro, which is a data serialization format: compact and fast binary data format
– Other type of sources: memory, exec (tail –f /opt/… )
• Flume = transactional design = each event is treated as a transaction
– The events are removed from a channel only after they are stored in the channel of the next
agent or in the terminal repository, thus maintaining a queue of current events until the
storage confirmation is received
• Distributed and scalable system : 4 agents in Kelkoo to spread the load installed
on 2 servers
• Channel type: file
– The File Channel is Flume’s persistent channel.
– Writes out all events to disk : no data loss on process or machine shutdown or crash.
– The File Channel ensures that any events committed into the channel are removed from the
channel only when a sink takes the events and commits the transaction, even if the machine
or agent crashed and was restarted.
– Designed to be highly concurrent and to handle several sources and sinks at the same time.
Flume monitoring
• Agent has a json servlet
– http://haddop-server:34545/metrics
– Returns a json output easily managable for
monitoring purpose, using a simple shell script
with jq extension
2 / KelkooDataPlatform :: HDFS
distributed storage
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
::: HDFS definition
• HDFS is a highly scalable, distributed file
system, meant to store large amount of data
• based on GoogleFS.
• Appears as a single disk: abstract physical
architecture , we can manipulate files as if we
were on a single disk.
• HDFS ?
– HADOOP DISTRIBUTED FILESYSTEM
HDFS Daemons Overview
• 2 types of processes:
– Namenode (must always be running):
• Stores the metadata about files and blocks on the filesystem, manage namespaces
– Maps a file name to a set of blocks
– Maps a block to a set of Datanodes
• Redirect client for read/writes to appropriate datanode
– Datanodes:
• Stores the data in local filesystem (ext4 in Kelkoo)
• periodically reports to Namenode the list of blocks they host and send heartbeat to the
namenode
• Serves data and meta-data to Clients
• Runs on several machines
NameNode
dc1-kdp-prod-hadoop-
00
Datanode 1
ex: dc1-kdp-prod-
hadoop-06
Datanode 2 Datanode 3 … Datanode n
Stanby
NameNode
dc1-kdp-prod-hadoop-
01
HDFS files and blocks example
• « toto.txt » file is managed by Namenode,
stored by Datanodes
– File split into blocks: Block #1 + Block # 2
– when a file is read, Datanode ask the namenode
on which blocks data is located
– Blocks are replicated (default is 3) : ensures
robustness and availability
Namenode
dc1-kdp-prod-
hadoop-00
Datanode 1 Datanode 2 Datanode 3
…
Datanode n
B
1
B
2
B
1
B
2
B
2
B
1
B
2
B
1
SAME BLOCK on
multiple
machines
Shared edits
HDFS High availability with Hadoop 2+Zookeeper service
3 Zookeeper instances
Active NameNode
dc1-kdp-prod-hadoop-00
Stanby NameNode
dc1-kdp-prod-hadoop-01
• Namenodes: one active and one standby
namenode, standby takes over if the active
namenode goes down, (avoid SPOF ).
• Zookeeper : High availibility of process
• Zookeeper server: keeps a copy of the
state of the entire system and persists
this information in local log files.
• ZooKeeper Failover Controller ZKFC :
monitors NameNode and failover when
the Active NameNode is unavailable.
• Quorum Journal Manager & JournalNodes:
High availability of data
• Instead of storing HDFS edit logs in a
single location (nfs), store them in
several remote locations => the
JournalNodes
• Active Namenode : writes edits to
journalNodes
• QJM (feature of the Namenode) ensures
that we « reach the quorum » ie ensure
the journal log is written to the majority
of the JournalNodes
• Stanby Namenode : read edits
• On the server: conf written in
JournalNode JournalNode JournalNode
QJM
Monitor and maintain active lock
ZKFCZKFC
Monitor and try to take act
writes
QJM
reads
HDFS User Interface
• Interacting with HDFS using Filesystem shell commands (as kookel)
– All commands are on hadoop doc
– hdfs dfs -<command> <options>
– hdfs dfs -du -chs /user/kookel/logs/flume/
• Command for HDFS administration
– hdfs dfsadmin -report -live | grep --color dc1-kdp-prod-hadoop-
10.prod.dc1.kelkoo.net -A1
Name: 10.76.99.60:50010 (dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net)
Hostname: dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net
Decommission Status : Normal
– hdfs dfsadmin -getDatanodeInfo dc1-kdp-prod-hadoop-
06.prod.dc1.kelkoo.net:50020
• Command for HighAvailabilty admin
– hdfs haadmin -failover nn1 nn2
• Web interface :
– http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfsclusterhealth.html
– http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfschealth.html
3/ Spark on Yarn
• HDFS = distributed storage
• Spark = distributed processing
• Apache Spark is an open-source data analytics cluster
computing framework.[1] Spark fits into the Hadoop open-
source community, building on top of the Hadoop
Distributed File System
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
Hadoop Yarn Cluster :: overview
• YARN ? (Yet Another Resource Negotiator)
• 2 types of processes:
– ResourceManager
• Arbitrates the available cluster resources
– helps manage the distributed applications running on the YARN system
– orchestrates the division of resources (compute, memory, bandwidth, etc.) to
underlying NodeManagers
– NodeManagers:
• Takes instruction from the ResourceManager
• Monitor containers resource usage (cpu, memory, disk)
• Reporting resources status to ResourceManager/Scheduler
Yarn
Resource
Manager
dc1-kdp-prod-
hadoop-02
NodeManager 1
ex: dc1-kdp-prod-
hadoop-06
NodeManager 2 NodeManager 3 … NodeManager n
on Yarn :: spark application lifecycle
CLIEN
T
YARN Container
Spark
Applicati
on
MasterYARN
Containe
r
YARN
Containe
r Spark
Execut
or
Spark
Execut
or
Yarn Resource
Manager
dc1-kdp-prod-
hadoop-02
YARN
Containe
r
Spark
Task
Spark
Execut
or
Spark
Task
Starts AM
on Yarn ::: spark application lifecycle
• Key concepts:
– Application: maybe a single job, sequence of jobs, KDP Spark
applications are mainly launched via Azkaban
• sparkAppRunner is a component that allow to run a sparkApp on Yarn
in Kelkoo
– Application Master:
• one per application, negotiate resource with YARN
• Runs inside a container
• Requests more hosts/containers to run the Spark application tasks
– Container @kelkoo => /d0/yarn/local/nm-local-
dir/usercache/kookel/appcache/application_1441098196522_0213/containe
r_1441098196522_0213_02_000006
– Spark Executor: A single JVM instance on a node that serves a
single Spark application.
– Spark Task: a unit of work on a partition of a distributed dataset.
Yarn Cluster focus
• Yarn Resource manager manages applications
• Yarn commands (installed on all servers running in the Yarn cluster)
– Can be useful to monitor applications from the Resource manager, commands are invoked by the bin/yarn
script
• yarn application -status application_1428487296152_99148
• yarn kill application_1428487296152_99148
• Yarn Rest API (more or less like yarn script but more complete)
– Xml output : curl --compressed -H "Accept: application/xml" -X GET http://hadoop-
server:8088/ws/v1/cluster/apps/application_1428487296152_99610
– Json output: curl --compressed -H "Accept: application/json" -X GET http://hadoop-
server:8088/ws/v1/cluster/apps/application_1428487296152_99610
– Cluster metrics : http://dc1-kdp-prod-hadoop-02.prod.dc1.kelkoo.net:8088/ws/v1/cluster/metrics
HDFS & YARN :: don’t mix things up
YARN HDFS
Resource
Manager
resources
allocation for
applications in
containers
NameNode
handles
metadata, map
files with
blocks,blocks
with datanodes
NodeManagers
manage
containers
Datanodes
Container
application
data
/d0/yarn/local
/d1/yarn/local
Data directory
/d0
/d1
Kelkoo DataPlatform:: Accessing data
• Hive: data warehouse infrastructure
– Kelkoo => hiveMetastoreSchema
– System for managing and querying structured
data, built on top of Hadoop
– Provides a simple query language called Hive QL,
which is based on SQL
– Hive holds persistent data
Hive
Metastore
DB
Hdfs file: consistency_key_metrics
Hive service
Monitoring
• checking DataPlatform input and output
– INPUT : missing logs
– OUTPUT : missing or failed reports
• Monitoring hdfs:
– Check all datanodes are live (HDFS) and HA is running fine (Zookeeper etc..)
– Monitor capacity
• Flume monitoring
– Check flume is up
– Monitor flume Channel on Grafana
• Yarn
– Monitor failed spark applications
– Check all nodemanagers are live (YARN)
– Monitor allocated ressources in containers
• Azkaban
– Monitor failed processes
• Monitoring tools
– Nagios:
– Grafana:
Wrapp up : what you must
remember
• Why Big data: to analyse large amounts of data
• FLUME:
– Aggregate and stream data into HDFS
– Transactionnal mode: client -> agent (source, channel, sink) -> HDFS Storage
– Recoverability / reliable and scalable
• HDFS cluster, high performance distributed filesystem
– NameNode (master), Datanodes (data)
– HDFS High Availability with Zookeeper and Journal nodes
– HDFS files and blocks, blocks are replicated (3 by default)
• Spark on Yarn
– Yarn ResoureceManager (master), NodeManager(data)
– Yarn is used to run Spark applications in distributed mode
• Hive mestatore
– Turn hdfs files into structured data
Further reading
• Sources:
– The « Bible »: http://hadoop.apache.org/docs/current/
– http://fr.slideshare.net/martyhall/hadoop-tutorial-hdfs-part-1-overview
– https://www.facebook.com/BigDataHyderabad
– HA :
• http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/
• http://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/
• http://www-
01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/bi_install_large_cluster
_qjm.html
– Spark & YARN:
• http://fr.slideshare.net/AdamKawa/apache-hadoop-yarn-simply-explained
• http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
• http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/
– http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/
– https://www.youtube.com/watch?v=nu16zSDw0BA
– HIVE: http://www.slideshare.net/athusoo/hive-apachecon-2008-presentation/
– FLUME:
• http://blog.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
• http://www.ibm.com/developerworks/library/bd-flumews/
• http://www.slideshare.net/getindata/apache-flume-37675297
• Filechannel data: http://grokbase.com/t/flume/user/128qcby5nx/filechannel-data-directory-usage
– Code: https://www.codatlas.com/github.com/apache
Setting up a big data platform at kelkoo

More Related Content

What's hot

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
Jan Pieter Posthuma
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
Heiko Loewe
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
Adam Kawa
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
kapa rohit
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
Uwe Printz
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
Konstantin V. Shvachko
 

What's hot (20)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 

Viewers also liked

2017 State of Data
2017 State of Data2017 State of Data
2017 State of Data
Meagan Kabobel Strawn
 
Big Data en Retail
Big Data en RetailBig Data en Retail
Big Data en Retail
SAP Latinoamérica
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data Analytics
Dr.Stefan Radtke
 
The Big Data Revolution in Retail
The Big Data Revolution in RetailThe Big Data Revolution in Retail
The Big Data Revolution in Retail
Market Research Reports, Inc.
 
Venture Scanner Retail Tech Report Q1 2017
Venture Scanner Retail Tech Report Q1 2017Venture Scanner Retail Tech Report Q1 2017
Venture Scanner Retail Tech Report Q1 2017
Nathan Pacer
 
IDC: The Next Steps in Digital Transformation
IDC: The Next Steps in Digital TransformationIDC: The Next Steps in Digital Transformation
IDC: The Next Steps in Digital Transformation
SOA PEOPLE
 
Big-Data in HealthCare _ Overview
Big-Data in HealthCare _ OverviewBig-Data in HealthCare _ Overview
Big-Data in HealthCare _ Overview
Hamdaoui Younes
 
Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4
Nathan Pacer
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
David Pittman
 

Viewers also liked (9)

2017 State of Data
2017 State of Data2017 State of Data
2017 State of Data
 
Big Data en Retail
Big Data en RetailBig Data en Retail
Big Data en Retail
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data Analytics
 
The Big Data Revolution in Retail
The Big Data Revolution in RetailThe Big Data Revolution in Retail
The Big Data Revolution in Retail
 
Venture Scanner Retail Tech Report Q1 2017
Venture Scanner Retail Tech Report Q1 2017Venture Scanner Retail Tech Report Q1 2017
Venture Scanner Retail Tech Report Q1 2017
 
IDC: The Next Steps in Digital Transformation
IDC: The Next Steps in Digital TransformationIDC: The Next Steps in Digital Transformation
IDC: The Next Steps in Digital Transformation
 
Big-Data in HealthCare _ Overview
Big-Data in HealthCare _ OverviewBig-Data in HealthCare _ Overview
Big-Data in HealthCare _ Overview
 
Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 

Similar to Setting up a big data platform at kelkoo

Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
Ceph Community
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
its_skm
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
Ratnakar Pawar
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
AakashBerlia1
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
Hadoop
HadoopHadoop
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Rajit Saha
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 

Similar to Setting up a big data platform at kelkoo (20)

Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 

Recently uploaded

Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 

Recently uploaded (20)

Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 

Setting up a big data platform at kelkoo

  • 1. Setting up a Big Data Platform at Kelkoo Data Platform Fabrice dos Santos 1st of Sep. 2015
  • 2. Kelkoo DataPlatform / Big Data ? • “Football is a simple game. Twenty-two men chase a ball for 90 minutes and at the end, the Germans always win” • And why do they win ? – Because they use big data ! – German team partnered with German software giant SAP AG to create a custom match analysis tool that collects and analyzes massive amounts of player performance data. • Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, andinformation privacy. • http://blogs.wsj.com/cio/2014/07/10/germanys-12th-man-at-the-world-cup-big- data/ • https://www.youtube.com/watch?v=JX5NLUViIMc • http://www.lesechos.fr/idees-debats/cercle/cercle-111048-le-monde-nouveau-du- big-data-1047390.php Gary Lineker
  • 3. Kelkoo DataPlatform transitioninng ::: AGENDA & Goal Flume Data collection and aggregation HDFS Distributed storage •Name node / Datanodes •HDFS INPUTS: LOGS •HDFS OUTPUT: REPORTS Spark on Yarn Distributed processing • ResourceManager / Nodemanager • Spark applications Hive / SparkSQL Query data Read and analyse • GOAL • give you the core concepts of hadoop platform @ Kelkoo • understand dataflow • starts getting used with vocabulary
  • 4. 1/ Kelkoo DataPlatform :: Flume Flume Agent Acheminent des données HDFS Stockage des données • Name node / Datanodes • HDFS INPUTS: LOGS • HDFS OUTPUT: REPORTS Spark on Yarn Analyse et traitement des données • ResourceManager / Nodemanager • Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
  • 5. FLUME AGENT (kelkoo_a1) Flume:: Core concepts Rece ive even ts Extern al client (ECS, KLS etc…) Push even ts Source (rImp, rLead) POLL Chann el (cImp, cLead) Forw ard event s Sink (sImp,sLe ad) Read events HDFS • Event : unit of data transported by Flume • [ Header (timestamp, hostname …) | Body (data)] • Client (ECS,KLS): point of origin of events that deliver them to a Flume agent • Flume agent (kelkooFlume_a?) jvm process: • Source: consume events and hands it over to the channel • Channel: buffers incoming events until a sink drains them for further transport => reliability • Sink: remove events from a channel and transmit it to next agent (HDFS sink here) • Channel periodically writes a backup check point out to disk => recoverability • HDFS storage : terminal repository Checkpoint /opt/kookel/data /kelkooFlume
  • 6. Flume @ Kelkoo • Source type – We use Avro, which is a data serialization format: compact and fast binary data format – Other type of sources: memory, exec (tail –f /opt/… ) • Flume = transactional design = each event is treated as a transaction – The events are removed from a channel only after they are stored in the channel of the next agent or in the terminal repository, thus maintaining a queue of current events until the storage confirmation is received • Distributed and scalable system : 4 agents in Kelkoo to spread the load installed on 2 servers • Channel type: file – The File Channel is Flume’s persistent channel. – Writes out all events to disk : no data loss on process or machine shutdown or crash. – The File Channel ensures that any events committed into the channel are removed from the channel only when a sink takes the events and commits the transaction, even if the machine or agent crashed and was restarted. – Designed to be highly concurrent and to handle several sources and sinks at the same time.
  • 7. Flume monitoring • Agent has a json servlet – http://haddop-server:34545/metrics – Returns a json output easily managable for monitoring purpose, using a simple shell script with jq extension
  • 8. 2 / KelkooDataPlatform :: HDFS distributed storage Flume Data collection and aggregation HDFS Distributed storage •Name node / Datanodes •HDFS INPUTS: LOGS •HDFS OUTPUT: REPORTS Spark on Yarn Distributed processing • ResourceManager / Nodemanager • Spark applications Hive / SparkSQL Query data Read and analyse
  • 9. ::: HDFS definition • HDFS is a highly scalable, distributed file system, meant to store large amount of data • based on GoogleFS. • Appears as a single disk: abstract physical architecture , we can manipulate files as if we were on a single disk. • HDFS ? – HADOOP DISTRIBUTED FILESYSTEM
  • 10. HDFS Daemons Overview • 2 types of processes: – Namenode (must always be running): • Stores the metadata about files and blocks on the filesystem, manage namespaces – Maps a file name to a set of blocks – Maps a block to a set of Datanodes • Redirect client for read/writes to appropriate datanode – Datanodes: • Stores the data in local filesystem (ext4 in Kelkoo) • periodically reports to Namenode the list of blocks they host and send heartbeat to the namenode • Serves data and meta-data to Clients • Runs on several machines NameNode dc1-kdp-prod-hadoop- 00 Datanode 1 ex: dc1-kdp-prod- hadoop-06 Datanode 2 Datanode 3 … Datanode n Stanby NameNode dc1-kdp-prod-hadoop- 01
  • 11. HDFS files and blocks example • « toto.txt » file is managed by Namenode, stored by Datanodes – File split into blocks: Block #1 + Block # 2 – when a file is read, Datanode ask the namenode on which blocks data is located – Blocks are replicated (default is 3) : ensures robustness and availability Namenode dc1-kdp-prod- hadoop-00 Datanode 1 Datanode 2 Datanode 3 … Datanode n B 1 B 2 B 1 B 2 B 2 B 1 B 2 B 1 SAME BLOCK on multiple machines
  • 12. Shared edits HDFS High availability with Hadoop 2+Zookeeper service 3 Zookeeper instances Active NameNode dc1-kdp-prod-hadoop-00 Stanby NameNode dc1-kdp-prod-hadoop-01 • Namenodes: one active and one standby namenode, standby takes over if the active namenode goes down, (avoid SPOF ). • Zookeeper : High availibility of process • Zookeeper server: keeps a copy of the state of the entire system and persists this information in local log files. • ZooKeeper Failover Controller ZKFC : monitors NameNode and failover when the Active NameNode is unavailable. • Quorum Journal Manager & JournalNodes: High availability of data • Instead of storing HDFS edit logs in a single location (nfs), store them in several remote locations => the JournalNodes • Active Namenode : writes edits to journalNodes • QJM (feature of the Namenode) ensures that we « reach the quorum » ie ensure the journal log is written to the majority of the JournalNodes • Stanby Namenode : read edits • On the server: conf written in JournalNode JournalNode JournalNode QJM Monitor and maintain active lock ZKFCZKFC Monitor and try to take act writes QJM reads
  • 13. HDFS User Interface • Interacting with HDFS using Filesystem shell commands (as kookel) – All commands are on hadoop doc – hdfs dfs -<command> <options> – hdfs dfs -du -chs /user/kookel/logs/flume/ • Command for HDFS administration – hdfs dfsadmin -report -live | grep --color dc1-kdp-prod-hadoop- 10.prod.dc1.kelkoo.net -A1 Name: 10.76.99.60:50010 (dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net) Hostname: dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net Decommission Status : Normal – hdfs dfsadmin -getDatanodeInfo dc1-kdp-prod-hadoop- 06.prod.dc1.kelkoo.net:50020 • Command for HighAvailabilty admin – hdfs haadmin -failover nn1 nn2 • Web interface : – http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfsclusterhealth.html – http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfschealth.html
  • 14. 3/ Spark on Yarn • HDFS = distributed storage • Spark = distributed processing • Apache Spark is an open-source data analytics cluster computing framework.[1] Spark fits into the Hadoop open- source community, building on top of the Hadoop Distributed File System Flume Data collection and aggregation HDFS Distributed storage •Name node / Datanodes •HDFS INPUTS: LOGS •HDFS OUTPUT: REPORTS Spark on Yarn Distributed processing • ResourceManager / Nodemanager • Spark applications Hive / SparkSQL Query data Read and analyse
  • 15. Hadoop Yarn Cluster :: overview • YARN ? (Yet Another Resource Negotiator) • 2 types of processes: – ResourceManager • Arbitrates the available cluster resources – helps manage the distributed applications running on the YARN system – orchestrates the division of resources (compute, memory, bandwidth, etc.) to underlying NodeManagers – NodeManagers: • Takes instruction from the ResourceManager • Monitor containers resource usage (cpu, memory, disk) • Reporting resources status to ResourceManager/Scheduler Yarn Resource Manager dc1-kdp-prod- hadoop-02 NodeManager 1 ex: dc1-kdp-prod- hadoop-06 NodeManager 2 NodeManager 3 … NodeManager n
  • 16. on Yarn :: spark application lifecycle CLIEN T YARN Container Spark Applicati on MasterYARN Containe r YARN Containe r Spark Execut or Spark Execut or Yarn Resource Manager dc1-kdp-prod- hadoop-02 YARN Containe r Spark Task Spark Execut or Spark Task Starts AM
  • 17. on Yarn ::: spark application lifecycle • Key concepts: – Application: maybe a single job, sequence of jobs, KDP Spark applications are mainly launched via Azkaban • sparkAppRunner is a component that allow to run a sparkApp on Yarn in Kelkoo – Application Master: • one per application, negotiate resource with YARN • Runs inside a container • Requests more hosts/containers to run the Spark application tasks – Container @kelkoo => /d0/yarn/local/nm-local- dir/usercache/kookel/appcache/application_1441098196522_0213/containe r_1441098196522_0213_02_000006 – Spark Executor: A single JVM instance on a node that serves a single Spark application. – Spark Task: a unit of work on a partition of a distributed dataset.
  • 18. Yarn Cluster focus • Yarn Resource manager manages applications • Yarn commands (installed on all servers running in the Yarn cluster) – Can be useful to monitor applications from the Resource manager, commands are invoked by the bin/yarn script • yarn application -status application_1428487296152_99148 • yarn kill application_1428487296152_99148 • Yarn Rest API (more or less like yarn script but more complete) – Xml output : curl --compressed -H "Accept: application/xml" -X GET http://hadoop- server:8088/ws/v1/cluster/apps/application_1428487296152_99610 – Json output: curl --compressed -H "Accept: application/json" -X GET http://hadoop- server:8088/ws/v1/cluster/apps/application_1428487296152_99610 – Cluster metrics : http://dc1-kdp-prod-hadoop-02.prod.dc1.kelkoo.net:8088/ws/v1/cluster/metrics
  • 19. HDFS & YARN :: don’t mix things up YARN HDFS Resource Manager resources allocation for applications in containers NameNode handles metadata, map files with blocks,blocks with datanodes NodeManagers manage containers Datanodes Container application data /d0/yarn/local /d1/yarn/local Data directory /d0 /d1
  • 20. Kelkoo DataPlatform:: Accessing data • Hive: data warehouse infrastructure – Kelkoo => hiveMetastoreSchema – System for managing and querying structured data, built on top of Hadoop – Provides a simple query language called Hive QL, which is based on SQL – Hive holds persistent data Hive Metastore DB Hdfs file: consistency_key_metrics Hive service
  • 21. Monitoring • checking DataPlatform input and output – INPUT : missing logs – OUTPUT : missing or failed reports • Monitoring hdfs: – Check all datanodes are live (HDFS) and HA is running fine (Zookeeper etc..) – Monitor capacity • Flume monitoring – Check flume is up – Monitor flume Channel on Grafana • Yarn – Monitor failed spark applications – Check all nodemanagers are live (YARN) – Monitor allocated ressources in containers • Azkaban – Monitor failed processes • Monitoring tools – Nagios: – Grafana:
  • 22. Wrapp up : what you must remember • Why Big data: to analyse large amounts of data • FLUME: – Aggregate and stream data into HDFS – Transactionnal mode: client -> agent (source, channel, sink) -> HDFS Storage – Recoverability / reliable and scalable • HDFS cluster, high performance distributed filesystem – NameNode (master), Datanodes (data) – HDFS High Availability with Zookeeper and Journal nodes – HDFS files and blocks, blocks are replicated (3 by default) • Spark on Yarn – Yarn ResoureceManager (master), NodeManager(data) – Yarn is used to run Spark applications in distributed mode • Hive mestatore – Turn hdfs files into structured data
  • 23. Further reading • Sources: – The « Bible »: http://hadoop.apache.org/docs/current/ – http://fr.slideshare.net/martyhall/hadoop-tutorial-hdfs-part-1-overview – https://www.facebook.com/BigDataHyderabad – HA : • http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/ • http://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/ • http://www- 01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/bi_install_large_cluster _qjm.html – Spark & YARN: • http://fr.slideshare.net/AdamKawa/apache-hadoop-yarn-simply-explained • http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ • http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/ – http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/ – https://www.youtube.com/watch?v=nu16zSDw0BA – HIVE: http://www.slideshare.net/athusoo/hive-apachecon-2008-presentation/ – FLUME: • http://blog.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/ • http://www.ibm.com/developerworks/library/bd-flumews/ • http://www.slideshare.net/getindata/apache-flume-37675297 • Filechannel data: http://grokbase.com/t/flume/user/128qcby5nx/filechannel-data-directory-usage – Code: https://www.codatlas.com/github.com/apache