SlideShare a Scribd company logo
1 of 24
Big Data processing using Hadoop
infrastructure
Use case
Intrum Justitia SDC
• 20 countries / different applications to process, store and analyze data
• Non-unified data storage formats
• High number of data objects (Records, Transactions, Entities)
• May have complex strong or loose relation rules
• Often: involving time-stamped events, made of incomplete data
2(24)
Possible solutions
• Custom built from ground solution
• Semi-clustered approach
‒ Tools from Oracle
‒ MySQL/PostgreSQL nodes
‒ Document-oriented tools like MongoDB
• “Big Data” approach (Map-Reduce)
3(24)
Map-Reduce
• Simple programming model that applies
to many large-scale computing problems
• Availabe in MongoDB for sharded data
• MapReduce tools usually offers:
‒ automatic parallelization
‒ load balancing
‒ network/disk transfer optimization
‒ handling of machine failures
‒ robustness
• Introduced by Google, open-source
implementation by Apache (Hadoop),
enterprise support by Cloudera
4(24)
Cloudera manager
YARN
Resource manager
CDH ecosystem
5(24)
HDFS
MapReduce
HBase
NoSQL
Hive
HQL Sqoop
Import
ExportParquet
Impala
SQL / ODBC
Pig
Pig Latin
Zookeeper
Coordination
Hue
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
• Mountable (FUSE)
6(24)
HDFS file read
Code to data, not data to code
7(24)
Client application
HDFS client
Name node
/bob/file.txt
Block A
Block B
DataNode 2
DataNode 3
DataNode 1
DataNode 3
DataNode 1
C
B
D
DataNode 2
C
A
D
DataNode 3
C
B
A
1
4
4
2
3
Map-Reduce workflow and redundancy
(6) Write
User
Program
Master
Worker
Worker
Worker
Split 0
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Output
File 0
Output
File 1
(1) Fork(1) Fork(1) Fork
(2) Assign map (2) Assign reduce
(3) Read (4) Local
write
Input files MAP phase
Intermediate
files
REDUCE phase Output files
8(24)
9(24)
Hive Pig
• High-level data access language
• Data Warehouse System for
Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• High-level data access language
(zero Java knowledge required)
• Data preparation (ETL)
• Pig Latin scripting language
SQL
Hive
MapReduce
Pig Latin
Pig
MapReduce
HiveQL vs. Pig Latin
insert into ValClickPerDMA
select dma, count(*) from geoinfo
join (
select name, ipaddr from users
join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
10(24)
Pig Latin is procedural,
where HQL is declarative.
Impala
• Real time queries ~100x faster comparing to Hive
• Direct data access
• Query data on HDFS or HBase
• Allows table joins and aggregation
11(24)
ODBC
Impala
HDFS HBase
Parquet
• Row Groups: A group of rows in columnar format
‒ One (or more) per split while reading
‒ Max size buffered in memory while writing
‒ About 50MB < row group < 1GB
• Columns Chunk: Data for one column in row group
‒ Column chunks can be read independently for efficient scans
• Page: Unit of access in a column chunk
‒ Should be big enough for efficient compression
‒ Min size to read while accessing a single record
‒ About 8KB < page < 1MB
Lars George, Cloudera. Data I/O, 2013
12(24)
HBase
Column-oriented data storage. Very large tables – billions
of rows X millions of columns.
• Low Latency
• Random Reads And Writes (by PK)
• Distributed Key/Value Store; automatic region sharding
• Simple API
‒ PUT
‒ GET
‒ DELETE
‒ SCAN
13(24)
HBase building blocks
• The most basic unit in HBase is a column
‒ Each column may have multiply versions with each distinct value contained in
separate cell
‒ One or more columns from a row, that is addressed uniquely by a row key
‒ Can have millions of columns
‒ Can be compressed or tagged to stay in memory
• A table is a collection of rows
‒ All rows and columns are always sorted lexicographically by their row key
14(24)
HBase read
15(24)
Client
ZooKeeper
HMaster
RegionServer
RegionServer
RegionServer
Hadoop
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Workflow is a collection of actions
• Arranged in Directed Acyclic Graph
16(24)
Job submission
Oozie server
"All done"
MapReduce
Pig
...
Schedule
Result
Schedule
Failure
Re-schedule
Result
Hadoop v1 architecture
• JobTracker
‒ Manage Cluster Resources
‒ Job Scheduling
• TaskTracker
‒ Per-node agent
‒ Task management
• Single purpose – batch processing
17(24)
Hadoop v2 - YARN architecture
• ResourceManager – Allocates cluster resources
• NodeManager – Enforces node resource allocations
• ApplicationMaster – Application lifecycle and task scheduler
• Multi-purpose system, Batch, interactive querying, streaming, aggregation
18(24)
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Hadoop infrastructure integration
18(24)
TxB
IW extract
program
IW extract
program
IW extract
program
TxB
TxB
Data In HDFS
RAW
(S)FTP
SCP
HTTP(S)
JDBC
HDFS
Binary
HDFS
Results
Data Out
Intrum Web
PAM
GSS
Catalyst
Dashboard
Parsing&Validation
Conversion&Compression
DataQualityAnalysis
BusinessAnalytics
DataTransformation
DataDelivery
Monitoring and Management
Hadoop Cluster
Development environment integration
Generic enterprise stack
• Maven
• Spring, Spring-Hadoop
• Hibernate
• H2, MySQL cluster
• LDAP, Kerberos
• CI (Jenkins ...)
20(24)
Java example: Hadoop task
@Component
public class HBaseEventLogToMySQL extnds Configured implements Tool {
@Autowired private EntityManagerFactory entityManagerFactory;
@Override public int run(String[] args) throws Exception {
LogAbstractEvent lastEvent = getLastMySQLEvent();
Scan scan;
String lastEventKey = "";
if (lastEvent == null) {
scan = new Scan();
} else {
lastEventKey = lastEvent.getEventKey();
scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE));
}
final Configuration conf = HBaseConfiguration.create(getConf());
HTable table = new HTable(conf, tableName);
ResultScanner resultScanner = table.getScanner(scan);
readRowsToMySQL(resultScanner);
}
21(24)
Java example: Map part (Table)
public class BasicProdStatusHbaseMapper extends TableMapper<Text,
MapWritableComparable> {
@Override public void map(ImmutableBytesWritable key, Result value, Context
context) throws IOException, InterruptedException {
byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER);
Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr);
MapWritableComparable map = new MapWritableComparable();
map.put(new Text("originalCapital"), new
DoubleWritable((Double)caseMap.get("OriginalCapital")));
map.put(new Text("remainingCapital"), new
DoubleWritable((Double)caseMap.get("RemainingCapital")));
context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1);
context.write(new Text(mainStatusCode), map);
}}
22(24)
Java example: Reduce part
public class BasicProdStatusHbaseReducer extends Reducer<Text,
MapWritableComparable, BasicProdStatusWritable, NullWritable> {
@Override protected void reduce(Text key, Iterable<MapWritableComparable>
values, Context context) throws IOException, InterruptedException {
String mainStatusCode = key.toString();
AggregationBean ab = new AggregationBean();
for (MapWritableComparable map : values){
double originalCapital = ((DoubleWritable)map.get(new
Text("originalCapital"))).get();
double remainingCapital = ((DoubleWritable)map.get(new
Text("remainingCapital"))).get();
ab.add(originalCapital,remainingCapital);
}
context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get());
context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }}
23(24)
Q&A
We are hiring!
Big Data processing using Hadoop
infrastructure

More Related Content

What's hot

Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 

What's hot (20)

Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 

Viewers also liked

Boston HUG - Cloudera presentation
Boston HUG - Cloudera presentationBoston HUG - Cloudera presentation
Boston HUG - Cloudera presentationreedshea
 
презентация савилова
презентация савиловапрезентация савилова
презентация савиловаdavidovanat
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesSmartDec
 
Лекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаЛекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаdrtanton
 
доклад электромагнитное излучение
доклад электромагнитное излучениедоклад электромагнитное излучение
доклад электромагнитное излучениеdavidovanat
 
Бинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMБинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMSmartDec
 
влияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотовлияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотоAndrei V, Zhuravlev
 
электромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаэлектромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаAndrei V, Zhuravlev
 
презентация
презентацияпрезентация
презентацияAndrey Fomenko
 
Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека amtc7
 
Негативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыНегативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыHakimova_AR
 
Системноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииСистемноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииAnatoly Levenchuk
 
влияние компьютера на человека
влияние компьютера на человекавлияние компьютера на человека
влияние компьютера на человекаZavirukhina
 
низкоуровневое программирование сегодня новые стандарты с++, программирован...
низкоуровневое программирование сегодня   новые стандарты с++, программирован...низкоуровневое программирование сегодня   новые стандарты с++, программирован...
низкоуровневое программирование сегодня новые стандарты с++, программирован...COMAQA.BY
 
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0COMAQA.BY
 
В топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеВ топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеCOMAQA.BY
 
Автоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихАвтоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихCOMAQA.BY
 

Viewers also liked (20)

JOOQ and Flyway
JOOQ and FlywayJOOQ and Flyway
JOOQ and Flyway
 
Boston HUG - Cloudera presentation
Boston HUG - Cloudera presentationBoston HUG - Cloudera presentation
Boston HUG - Cloudera presentation
 
презентация савилова
презентация савиловапрезентация савилова
презентация савилова
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machines
 
Лекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаЛекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организма
 
доклад электромагнитное излучение
доклад электромагнитное излучениедоклад электромагнитное излучение
доклад электромагнитное излучение
 
Бинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMБинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVM
 
влияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотовлияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сото
 
C++ idioms
C++ idiomsC++ idioms
C++ idioms
 
электромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаэлектромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человека
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
презентация
презентацияпрезентация
презентация
 
Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека
 
Негативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыНегативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защиты
 
Системноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииСистемноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образовании
 
влияние компьютера на человека
влияние компьютера на человекавлияние компьютера на человека
влияние компьютера на человека
 
низкоуровневое программирование сегодня новые стандарты с++, программирован...
низкоуровневое программирование сегодня   новые стандарты с++, программирован...низкоуровневое программирование сегодня   новые стандарты с++, программирован...
низкоуровневое программирование сегодня новые стандарты с++, программирован...
 
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
 
В топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеВ топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стеке
 
Автоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихАвтоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающих
 

Similar to Big Data Processing Using Hadoop Infrastructure

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!MongoDB
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopMongoDB
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 

Similar to Big Data Processing Using Hadoop Infrastructure (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop
HadoopHadoop
Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Hadoop
HadoopHadoop
Hadoop
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
מיכאל
מיכאלמיכאל
מיכאל
 

More from Dmitry Buzdin

How Payment Cards Really Work?
How Payment Cards Really Work?How Payment Cards Really Work?
How Payment Cards Really Work?Dmitry Buzdin
 
Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Dmitry Buzdin
 
How to grow your own Microservice?
How to grow your own Microservice?How to grow your own Microservice?
How to grow your own Microservice?Dmitry Buzdin
 
How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?Dmitry Buzdin
 
Delivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDelivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDmitry Buzdin
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIsDmitry Buzdin
 
Архитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахАрхитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахDmitry Buzdin
 
Riding Redis @ask.fm
Riding Redis @ask.fmRiding Redis @ask.fm
Riding Redis @ask.fmDmitry Buzdin
 
Rubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIRubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIDmitry Buzdin
 
Rubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsRubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsDmitry Buzdin
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional ProgrammingDmitry Buzdin
 
Rubylight programming contest
Rubylight programming contestRubylight programming contest
Rubylight programming contestDmitry Buzdin
 
Continuous Delivery
Continuous Delivery Continuous Delivery
Continuous Delivery Dmitry Buzdin
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOpsDmitry Buzdin
 
Thread Dump Analysis
Thread Dump AnalysisThread Dump Analysis
Thread Dump AnalysisDmitry Buzdin
 
Pragmatic Java Test Automation
Pragmatic Java Test AutomationPragmatic Java Test Automation
Pragmatic Java Test AutomationDmitry Buzdin
 

More from Dmitry Buzdin (20)

How Payment Cards Really Work?
How Payment Cards Really Work?How Payment Cards Really Work?
How Payment Cards Really Work?
 
Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?
 
How to grow your own Microservice?
How to grow your own Microservice?How to grow your own Microservice?
How to grow your own Microservice?
 
How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?
 
Delivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDelivery Pipeline for Windows Machines
Delivery Pipeline for Windows Machines
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIs
 
Whats New in Java 8
Whats New in Java 8Whats New in Java 8
Whats New in Java 8
 
Архитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахАрхитектура Ленты на Одноклассниках
Архитектура Ленты на Одноклассниках
 
Dart Workshop
Dart WorkshopDart Workshop
Dart Workshop
 
Riding Redis @ask.fm
Riding Redis @ask.fmRiding Redis @ask.fm
Riding Redis @ask.fm
 
Rubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIRubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part II
 
Rubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsRubylight Pattern-Matching Solutions
Rubylight Pattern-Matching Solutions
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
 
Rubylight programming contest
Rubylight programming contestRubylight programming contest
Rubylight programming contest
 
Continuous Delivery
Continuous Delivery Continuous Delivery
Continuous Delivery
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Thread Dump Analysis
Thread Dump AnalysisThread Dump Analysis
Thread Dump Analysis
 
Pragmatic Java Test Automation
Pragmatic Java Test AutomationPragmatic Java Test Automation
Pragmatic Java Test Automation
 
Mlocjs buzdin
Mlocjs buzdinMlocjs buzdin
Mlocjs buzdin
 

Recently uploaded

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Big Data Processing Using Hadoop Infrastructure

  • 1. Big Data processing using Hadoop infrastructure
  • 2. Use case Intrum Justitia SDC • 20 countries / different applications to process, store and analyze data • Non-unified data storage formats • High number of data objects (Records, Transactions, Entities) • May have complex strong or loose relation rules • Often: involving time-stamped events, made of incomplete data 2(24)
  • 3. Possible solutions • Custom built from ground solution • Semi-clustered approach ‒ Tools from Oracle ‒ MySQL/PostgreSQL nodes ‒ Document-oriented tools like MongoDB • “Big Data” approach (Map-Reduce) 3(24)
  • 4. Map-Reduce • Simple programming model that applies to many large-scale computing problems • Availabe in MongoDB for sharded data • MapReduce tools usually offers: ‒ automatic parallelization ‒ load balancing ‒ network/disk transfer optimization ‒ handling of machine failures ‒ robustness • Introduced by Google, open-source implementation by Apache (Hadoop), enterprise support by Cloudera 4(24)
  • 5. Cloudera manager YARN Resource manager CDH ecosystem 5(24) HDFS MapReduce HBase NoSQL Hive HQL Sqoop Import ExportParquet Impala SQL / ODBC Pig Pig Latin Zookeeper Coordination Hue
  • 6. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool • Mountable (FUSE) 6(24)
  • 7. HDFS file read Code to data, not data to code 7(24) Client application HDFS client Name node /bob/file.txt Block A Block B DataNode 2 DataNode 3 DataNode 1 DataNode 3 DataNode 1 C B D DataNode 2 C A D DataNode 3 C B A 1 4 4 2 3
  • 8. Map-Reduce workflow and redundancy (6) Write User Program Master Worker Worker Worker Split 0 Split 1 Split 2 Split 3 Split 4 Worker Worker Output File 0 Output File 1 (1) Fork(1) Fork(1) Fork (2) Assign map (2) Assign reduce (3) Read (4) Local write Input files MAP phase Intermediate files REDUCE phase Output files 8(24)
  • 9. 9(24) Hive Pig • High-level data access language • Data Warehouse System for Hadoop • Data Aggregation • Ad-Hoc Queries • SQL-like Language (HiveQL) • High-level data access language (zero Java knowledge required) • Data preparation (ETL) • Pig Latin scripting language SQL Hive MapReduce Pig Latin Pig MapReduce
  • 10. HiveQL vs. Pig Latin insert into ValClickPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; 10(24) Pig Latin is procedural, where HQL is declarative.
  • 11. Impala • Real time queries ~100x faster comparing to Hive • Direct data access • Query data on HDFS or HBase • Allows table joins and aggregation 11(24) ODBC Impala HDFS HBase
  • 12. Parquet • Row Groups: A group of rows in columnar format ‒ One (or more) per split while reading ‒ Max size buffered in memory while writing ‒ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‒ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‒ Should be big enough for efficient compression ‒ Min size to read while accessing a single record ‒ About 8KB < page < 1MB Lars George, Cloudera. Data I/O, 2013 12(24)
  • 13. HBase Column-oriented data storage. Very large tables – billions of rows X millions of columns. • Low Latency • Random Reads And Writes (by PK) • Distributed Key/Value Store; automatic region sharding • Simple API ‒ PUT ‒ GET ‒ DELETE ‒ SCAN 13(24)
  • 14. HBase building blocks • The most basic unit in HBase is a column ‒ Each column may have multiply versions with each distinct value contained in separate cell ‒ One or more columns from a row, that is addressed uniquely by a row key ‒ Can have millions of columns ‒ Can be compressed or tagged to stay in memory • A table is a collection of rows ‒ All rows and columns are always sorted lexicographically by their row key 14(24)
  • 16. Hadoop Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs. • Workflow is a collection of actions • Arranged in Directed Acyclic Graph 16(24) Job submission Oozie server "All done" MapReduce Pig ... Schedule Result Schedule Failure Re-schedule Result
  • 17. Hadoop v1 architecture • JobTracker ‒ Manage Cluster Resources ‒ Job Scheduling • TaskTracker ‒ Per-node agent ‒ Task management • Single purpose – batch processing 17(24)
  • 18. Hadoop v2 - YARN architecture • ResourceManager – Allocates cluster resources • NodeManager – Enforces node resource allocations • ApplicationMaster – Application lifecycle and task scheduler • Multi-purpose system, Batch, interactive querying, streaming, aggregation 18(24) Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 19. Hadoop infrastructure integration 18(24) TxB IW extract program IW extract program IW extract program TxB TxB Data In HDFS RAW (S)FTP SCP HTTP(S) JDBC HDFS Binary HDFS Results Data Out Intrum Web PAM GSS Catalyst Dashboard Parsing&Validation Conversion&Compression DataQualityAnalysis BusinessAnalytics DataTransformation DataDelivery Monitoring and Management Hadoop Cluster
  • 20. Development environment integration Generic enterprise stack • Maven • Spring, Spring-Hadoop • Hibernate • H2, MySQL cluster • LDAP, Kerberos • CI (Jenkins ...) 20(24)
  • 21. Java example: Hadoop task @Component public class HBaseEventLogToMySQL extnds Configured implements Tool { @Autowired private EntityManagerFactory entityManagerFactory; @Override public int run(String[] args) throws Exception { LogAbstractEvent lastEvent = getLastMySQLEvent(); Scan scan; String lastEventKey = ""; if (lastEvent == null) { scan = new Scan(); } else { lastEventKey = lastEvent.getEventKey(); scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE)); } final Configuration conf = HBaseConfiguration.create(getConf()); HTable table = new HTable(conf, tableName); ResultScanner resultScanner = table.getScanner(scan); readRowsToMySQL(resultScanner); } 21(24)
  • 22. Java example: Map part (Table) public class BasicProdStatusHbaseMapper extends TableMapper<Text, MapWritableComparable> { @Override public void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER); Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr); MapWritableComparable map = new MapWritableComparable(); map.put(new Text("originalCapital"), new DoubleWritable((Double)caseMap.get("OriginalCapital"))); map.put(new Text("remainingCapital"), new DoubleWritable((Double)caseMap.get("RemainingCapital"))); context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1); context.write(new Text(mainStatusCode), map); }} 22(24)
  • 23. Java example: Reduce part public class BasicProdStatusHbaseReducer extends Reducer<Text, MapWritableComparable, BasicProdStatusWritable, NullWritable> { @Override protected void reduce(Text key, Iterable<MapWritableComparable> values, Context context) throws IOException, InterruptedException { String mainStatusCode = key.toString(); AggregationBean ab = new AggregationBean(); for (MapWritableComparable map : values){ double originalCapital = ((DoubleWritable)map.get(new Text("originalCapital"))).get(); double remainingCapital = ((DoubleWritable)map.get(new Text("remainingCapital"))).get(); ab.add(originalCapital,remainingCapital); } context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get()); context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }} 23(24)
  • 24. Q&A We are hiring! Big Data processing using Hadoop infrastructure