Big Data processing using Hadoop
infrastructure
Use case
Intrum Justitia SDC
• 20 countries / different applications to process, store and analyze data
• Non-unified data storage formats
• High number of data objects (Records, Transactions, Entities)
• May have complex strong or loose relation rules
• Often: involving time-stamped events, made of incomplete data
2(24)
Possible solutions
• Custom built from ground solution
• Semi-clustered approach
‒ Tools from Oracle
‒ MySQL/PostgreSQL nodes
‒ Document-oriented tools like MongoDB
• “Big Data” approach (Map-Reduce)
3(24)
Map-Reduce
• Simple programming model that applies
to many large-scale computing problems
• Availabe in MongoDB for sharded data
• MapReduce tools usually offers:
‒ automatic parallelization
‒ load balancing
‒ network/disk transfer optimization
‒ handling of machine failures
‒ robustness
• Introduced by Google, open-source
implementation by Apache (Hadoop),
enterprise support by Cloudera
4(24)
Cloudera manager
YARN
Resource manager
CDH ecosystem
5(24)
HDFS
MapReduce
HBase
NoSQL
Hive
HQL Sqoop
Import
ExportParquet
Impala
SQL / ODBC
Pig
Pig Latin
Zookeeper
Coordination
Hue
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
• Mountable (FUSE)
6(24)
HDFS file read
Code to data, not data to code
7(24)
Client application
HDFS client
Name node
/bob/file.txt
Block A
Block B
DataNode 2
DataNode 3
DataNode 1
DataNode 3
DataNode 1
C
B
D
DataNode 2
C
A
D
DataNode 3
C
B
A
1
4
4
2
3
Map-Reduce workflow and redundancy
(6) Write
User
Program
Master
Worker
Worker
Worker
Split 0
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Output
File 0
Output
File 1
(1) Fork(1) Fork(1) Fork
(2) Assign map (2) Assign reduce
(3) Read (4) Local
write
Input files MAP phase
Intermediate
files
REDUCE phase Output files
8(24)
9(24)
Hive Pig
• High-level data access language
• Data Warehouse System for
Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• High-level data access language
(zero Java knowledge required)
• Data preparation (ETL)
• Pig Latin scripting language
SQL
Hive
MapReduce
Pig Latin
Pig
MapReduce
HiveQL vs. Pig Latin
insert into ValClickPerDMA
select dma, count(*) from geoinfo
join (
select name, ipaddr from users
join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
10(24)
Pig Latin is procedural,
where HQL is declarative.
Impala
• Real time queries ~100x faster comparing to Hive
• Direct data access
• Query data on HDFS or HBase
• Allows table joins and aggregation
11(24)
ODBC
Impala
HDFS HBase
Parquet
• Row Groups: A group of rows in columnar format
‒ One (or more) per split while reading
‒ Max size buffered in memory while writing
‒ About 50MB < row group < 1GB
• Columns Chunk: Data for one column in row group
‒ Column chunks can be read independently for efficient scans
• Page: Unit of access in a column chunk
‒ Should be big enough for efficient compression
‒ Min size to read while accessing a single record
‒ About 8KB < page < 1MB
Lars George, Cloudera. Data I/O, 2013
12(24)
HBase
Column-oriented data storage. Very large tables – billions
of rows X millions of columns.
• Low Latency
• Random Reads And Writes (by PK)
• Distributed Key/Value Store; automatic region sharding
• Simple API
‒ PUT
‒ GET
‒ DELETE
‒ SCAN
13(24)
HBase building blocks
• The most basic unit in HBase is a column
‒ Each column may have multiply versions with each distinct value contained in
separate cell
‒ One or more columns from a row, that is addressed uniquely by a row key
‒ Can have millions of columns
‒ Can be compressed or tagged to stay in memory
• A table is a collection of rows
‒ All rows and columns are always sorted lexicographically by their row key
14(24)
HBase read
15(24)
Client
ZooKeeper
HMaster
RegionServer
RegionServer
RegionServer
Hadoop
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Workflow is a collection of actions
• Arranged in Directed Acyclic Graph
16(24)
Job submission
Oozie server
"All done"
MapReduce
Pig
...
Schedule
Result
Schedule
Failure
Re-schedule
Result
Hadoop v1 architecture
• JobTracker
‒ Manage Cluster Resources
‒ Job Scheduling
• TaskTracker
‒ Per-node agent
‒ Task management
• Single purpose – batch processing
17(24)
Hadoop v2 - YARN architecture
• ResourceManager – Allocates cluster resources
• NodeManager – Enforces node resource allocations
• ApplicationMaster – Application lifecycle and task scheduler
• Multi-purpose system, Batch, interactive querying, streaming, aggregation
18(24)
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Hadoop infrastructure integration
18(24)
TxB
IW extract
program
IW extract
program
IW extract
program
TxB
TxB
Data In HDFS
RAW
(S)FTP
SCP
HTTP(S)
JDBC
HDFS
Binary
HDFS
Results
Data Out
Intrum Web
PAM
GSS
Catalyst
Dashboard
Parsing&Validation
Conversion&Compression
DataQualityAnalysis
BusinessAnalytics
DataTransformation
DataDelivery
Monitoring and Management
Hadoop Cluster
Development environment integration
Generic enterprise stack
• Maven
• Spring, Spring-Hadoop
• Hibernate
• H2, MySQL cluster
• LDAP, Kerberos
• CI (Jenkins ...)
20(24)
Java example: Hadoop task
@Component
public class HBaseEventLogToMySQL extnds Configured implements Tool {
@Autowired private EntityManagerFactory entityManagerFactory;
@Override public int run(String[] args) throws Exception {
LogAbstractEvent lastEvent = getLastMySQLEvent();
Scan scan;
String lastEventKey = "";
if (lastEvent == null) {
scan = new Scan();
} else {
lastEventKey = lastEvent.getEventKey();
scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE));
}
final Configuration conf = HBaseConfiguration.create(getConf());
HTable table = new HTable(conf, tableName);
ResultScanner resultScanner = table.getScanner(scan);
readRowsToMySQL(resultScanner);
}
21(24)
Java example: Map part (Table)
public class BasicProdStatusHbaseMapper extends TableMapper<Text,
MapWritableComparable> {
@Override public void map(ImmutableBytesWritable key, Result value, Context
context) throws IOException, InterruptedException {
byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER);
Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr);
MapWritableComparable map = new MapWritableComparable();
map.put(new Text("originalCapital"), new
DoubleWritable((Double)caseMap.get("OriginalCapital")));
map.put(new Text("remainingCapital"), new
DoubleWritable((Double)caseMap.get("RemainingCapital")));
context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1);
context.write(new Text(mainStatusCode), map);
}}
22(24)
Java example: Reduce part
public class BasicProdStatusHbaseReducer extends Reducer<Text,
MapWritableComparable, BasicProdStatusWritable, NullWritable> {
@Override protected void reduce(Text key, Iterable<MapWritableComparable>
values, Context context) throws IOException, InterruptedException {
String mainStatusCode = key.toString();
AggregationBean ab = new AggregationBean();
for (MapWritableComparable map : values){
double originalCapital = ((DoubleWritable)map.get(new
Text("originalCapital"))).get();
double remainingCapital = ((DoubleWritable)map.get(new
Text("remainingCapital"))).get();
ab.add(originalCapital,remainingCapital);
}
context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get());
context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }}
23(24)
Q&A
We are hiring!
Big Data processing using Hadoop
infrastructure

Big Data Processing Using Hadoop Infrastructure

  • 1.
    Big Data processingusing Hadoop infrastructure
  • 2.
    Use case Intrum JustitiaSDC • 20 countries / different applications to process, store and analyze data • Non-unified data storage formats • High number of data objects (Records, Transactions, Entities) • May have complex strong or loose relation rules • Often: involving time-stamped events, made of incomplete data 2(24)
  • 3.
    Possible solutions • Custombuilt from ground solution • Semi-clustered approach ‒ Tools from Oracle ‒ MySQL/PostgreSQL nodes ‒ Document-oriented tools like MongoDB • “Big Data” approach (Map-Reduce) 3(24)
  • 4.
    Map-Reduce • Simple programmingmodel that applies to many large-scale computing problems • Availabe in MongoDB for sharded data • MapReduce tools usually offers: ‒ automatic parallelization ‒ load balancing ‒ network/disk transfer optimization ‒ handling of machine failures ‒ robustness • Introduced by Google, open-source implementation by Apache (Hadoop), enterprise support by Cloudera 4(24)
  • 5.
    Cloudera manager YARN Resource manager CDHecosystem 5(24) HDFS MapReduce HBase NoSQL Hive HQL Sqoop Import ExportParquet Impala SQL / ODBC Pig Pig Latin Zookeeper Coordination Hue
  • 6.
    HDFS • Hadoop DistributedFile System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool • Mountable (FUSE) 6(24)
  • 7.
    HDFS file read Codeto data, not data to code 7(24) Client application HDFS client Name node /bob/file.txt Block A Block B DataNode 2 DataNode 3 DataNode 1 DataNode 3 DataNode 1 C B D DataNode 2 C A D DataNode 3 C B A 1 4 4 2 3
  • 8.
    Map-Reduce workflow andredundancy (6) Write User Program Master Worker Worker Worker Split 0 Split 1 Split 2 Split 3 Split 4 Worker Worker Output File 0 Output File 1 (1) Fork(1) Fork(1) Fork (2) Assign map (2) Assign reduce (3) Read (4) Local write Input files MAP phase Intermediate files REDUCE phase Output files 8(24)
  • 9.
    9(24) Hive Pig • High-leveldata access language • Data Warehouse System for Hadoop • Data Aggregation • Ad-Hoc Queries • SQL-like Language (HiveQL) • High-level data access language (zero Java knowledge required) • Data preparation (ETL) • Pig Latin scripting language SQL Hive MapReduce Pig Latin Pig MapReduce
  • 10.
    HiveQL vs. PigLatin insert into ValClickPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; 10(24) Pig Latin is procedural, where HQL is declarative.
  • 11.
    Impala • Real timequeries ~100x faster comparing to Hive • Direct data access • Query data on HDFS or HBase • Allows table joins and aggregation 11(24) ODBC Impala HDFS HBase
  • 12.
    Parquet • Row Groups:A group of rows in columnar format ‒ One (or more) per split while reading ‒ Max size buffered in memory while writing ‒ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‒ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‒ Should be big enough for efficient compression ‒ Min size to read while accessing a single record ‒ About 8KB < page < 1MB Lars George, Cloudera. Data I/O, 2013 12(24)
  • 13.
    HBase Column-oriented data storage.Very large tables – billions of rows X millions of columns. • Low Latency • Random Reads And Writes (by PK) • Distributed Key/Value Store; automatic region sharding • Simple API ‒ PUT ‒ GET ‒ DELETE ‒ SCAN 13(24)
  • 14.
    HBase building blocks •The most basic unit in HBase is a column ‒ Each column may have multiply versions with each distinct value contained in separate cell ‒ One or more columns from a row, that is addressed uniquely by a row key ‒ Can have millions of columns ‒ Can be compressed or tagged to stay in memory • A table is a collection of rows ‒ All rows and columns are always sorted lexicographically by their row key 14(24)
  • 15.
  • 16.
    Hadoop Oozie • Oozie isa workflow scheduler system to manage Hadoop jobs. • Workflow is a collection of actions • Arranged in Directed Acyclic Graph 16(24) Job submission Oozie server "All done" MapReduce Pig ... Schedule Result Schedule Failure Re-schedule Result
  • 17.
    Hadoop v1 architecture •JobTracker ‒ Manage Cluster Resources ‒ Job Scheduling • TaskTracker ‒ Per-node agent ‒ Task management • Single purpose – batch processing 17(24)
  • 18.
    Hadoop v2 -YARN architecture • ResourceManager – Allocates cluster resources • NodeManager – Enforces node resource allocations • ApplicationMaster – Application lifecycle and task scheduler • Multi-purpose system, Batch, interactive querying, streaming, aggregation 18(24) Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 19.
    Hadoop infrastructure integration 18(24) TxB IWextract program IW extract program IW extract program TxB TxB Data In HDFS RAW (S)FTP SCP HTTP(S) JDBC HDFS Binary HDFS Results Data Out Intrum Web PAM GSS Catalyst Dashboard Parsing&Validation Conversion&Compression DataQualityAnalysis BusinessAnalytics DataTransformation DataDelivery Monitoring and Management Hadoop Cluster
  • 20.
    Development environment integration Genericenterprise stack • Maven • Spring, Spring-Hadoop • Hibernate • H2, MySQL cluster • LDAP, Kerberos • CI (Jenkins ...) 20(24)
  • 21.
    Java example: Hadooptask @Component public class HBaseEventLogToMySQL extnds Configured implements Tool { @Autowired private EntityManagerFactory entityManagerFactory; @Override public int run(String[] args) throws Exception { LogAbstractEvent lastEvent = getLastMySQLEvent(); Scan scan; String lastEventKey = ""; if (lastEvent == null) { scan = new Scan(); } else { lastEventKey = lastEvent.getEventKey(); scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE)); } final Configuration conf = HBaseConfiguration.create(getConf()); HTable table = new HTable(conf, tableName); ResultScanner resultScanner = table.getScanner(scan); readRowsToMySQL(resultScanner); } 21(24)
  • 22.
    Java example: Mappart (Table) public class BasicProdStatusHbaseMapper extends TableMapper<Text, MapWritableComparable> { @Override public void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER); Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr); MapWritableComparable map = new MapWritableComparable(); map.put(new Text("originalCapital"), new DoubleWritable((Double)caseMap.get("OriginalCapital"))); map.put(new Text("remainingCapital"), new DoubleWritable((Double)caseMap.get("RemainingCapital"))); context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1); context.write(new Text(mainStatusCode), map); }} 22(24)
  • 23.
    Java example: Reducepart public class BasicProdStatusHbaseReducer extends Reducer<Text, MapWritableComparable, BasicProdStatusWritable, NullWritable> { @Override protected void reduce(Text key, Iterable<MapWritableComparable> values, Context context) throws IOException, InterruptedException { String mainStatusCode = key.toString(); AggregationBean ab = new AggregationBean(); for (MapWritableComparable map : values){ double originalCapital = ((DoubleWritable)map.get(new Text("originalCapital"))).get(); double remainingCapital = ((DoubleWritable)map.get(new Text("remainingCapital"))).get(); ab.add(originalCapital,remainingCapital); } context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get()); context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }} 23(24)
  • 24.
    Q&A We are hiring! BigData processing using Hadoop infrastructure