Big Data Processing Using Hadoop Infrastructure

Big Data processing using Hadoop
infrastructure

Use case
Intrum Justitia SDC
• 20 countries / different applications to process, store and analyze data
• Non-unified data storage formats
• High number of data objects (Records, Transactions, Entities)
• May have complex strong or loose relation rules
• Often: involving time-stamped events, made of incomplete data
2(24)

Possible solutions
• Custom built from ground solution
• Semi-clustered approach
‒ Tools from Oracle
‒ MySQL/PostgreSQL nodes
‒ Document-oriented tools like MongoDB
• “Big Data” approach (Map-Reduce)
3(24)

Map-Reduce
• Simple programming model that applies
to many large-scale computing problems
• Availabe in MongoDB for sharded data
• MapReduce tools usually offers:
‒ automatic parallelization
‒ load balancing
‒ network/disk transfer optimization
‒ handling of machine failures
‒ robustness
• Introduced by Google, open-source
implementation by Apache (Hadoop),
enterprise support by Cloudera
4(24)

Cloudera manager
YARN
Resource manager
CDH ecosystem
5(24)
HDFS
MapReduce
HBase
NoSQL
Hive
HQL Sqoop
Import
ExportParquet
Impala
SQL / ODBC
Pig
Pig Latin
Zookeeper
Coordination
Hue

HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
• Mountable (FUSE)
6(24)

HDFS file read
Code to data, not data to code
7(24)
Client application
HDFS client
Name node
/bob/file.txt
Block A
Block B
DataNode 2
DataNode 3
DataNode 1
DataNode 3
DataNode 1
C
B
D
DataNode 2
C
A
D
DataNode 3
C
B
A
1
4
4
2
3

Map-Reduce workflow and redundancy
(6) Write
User
Program
Master
Worker
Worker
Worker
Split 0
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Output
File 0
Output
File 1
(1) Fork(1) Fork(1) Fork
(2) Assign map (2) Assign reduce
(3) Read (4) Local
write
Input files MAP phase
Intermediate
files
REDUCE phase Output files
8(24)

9(24)
Hive Pig
• High-level data access language
• Data Warehouse System for
Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• High-level data access language
(zero Java knowledge required)
• Data preparation (ETL)
• Pig Latin scripting language
SQL
Hive
MapReduce
Pig Latin
Pig
MapReduce

HiveQL vs. Pig Latin
insert into ValClickPerDMA
select dma, count(*) from geoinfo
join (
select name, ipaddr from users
join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
10(24)
Pig Latin is procedural,
where HQL is declarative.

Impala
• Real time queries ~100x faster comparing to Hive
• Direct data access
• Query data on HDFS or HBase
• Allows table joins and aggregation
11(24)
ODBC
Impala
HDFS HBase

Parquet
• Row Groups: A group of rows in columnar format
‒ One (or more) per split while reading
‒ Max size buffered in memory while writing
‒ About 50MB < row group < 1GB
• Columns Chunk: Data for one column in row group
‒ Column chunks can be read independently for efﬁcient scans
• Page: Unit of access in a column chunk
‒ Should be big enough for efﬁcient compression
‒ Min size to read while accessing a single record
‒ About 8KB < page < 1MB
Lars George, Cloudera. Data I/O, 2013
12(24)

HBase
Column-oriented data storage. Very large tables – billions
of rows X millions of columns.
• Low Latency
• Random Reads And Writes (by PK)
• Distributed Key/Value Store; automatic region sharding
• Simple API
‒ PUT
‒ GET
‒ DELETE
‒ SCAN
13(24)

HBase building blocks
• The most basic unit in HBase is a column
‒ Each column may have multiply versions with each distinct value contained in
separate cell
‒ One or more columns from a row, that is addressed uniquely by a row key
‒ Can have millions of columns
‒ Can be compressed or tagged to stay in memory
• A table is a collection of rows
‒ All rows and columns are always sorted lexicographically by their row key
14(24)

HBase read
15(24)
Client
ZooKeeper
HMaster
RegionServer
RegionServer
RegionServer

Hadoop
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Workflow is a collection of actions
• Arranged in Directed Acyclic Graph
16(24)
Job submission
Oozie server
"All done"
MapReduce
Pig
...
Schedule
Result
Schedule
Failure
Re-schedule
Result

Hadoop v1 architecture
• JobTracker
‒ Manage Cluster Resources
‒ Job Scheduling
• TaskTracker
‒ Per-node agent
‒ Task management
• Single purpose – batch processing
17(24)

Hadoop v2 - YARN architecture
• ResourceManager – Allocates cluster resources
• NodeManager – Enforces node resource allocations
• ApplicationMaster – Application lifecycle and task scheduler
• Multi-purpose system, Batch, interactive querying, streaming, aggregation
18(24)
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request

Hadoop infrastructure integration
18(24)
TxB
IW extract
program
IW extract
program
IW extract
program
TxB
TxB
Data In HDFS
RAW
(S)FTP
SCP
HTTP(S)
JDBC
HDFS
Binary
HDFS
Results
Data Out
Intrum Web
PAM
GSS
Catalyst
Dashboard
Parsing&Validation
Conversion&Compression
DataQualityAnalysis
BusinessAnalytics
DataTransformation
DataDelivery
Monitoring and Management
Hadoop Cluster

Development environment integration
Generic enterprise stack
• Maven
• Spring, Spring-Hadoop
• Hibernate
• H2, MySQL cluster
• LDAP, Kerberos
• CI (Jenkins ...)
20(24)

Java example: Hadoop task
@Component
public class HBaseEventLogToMySQL extnds Configured implements Tool {
@Autowired private EntityManagerFactory entityManagerFactory;
@Override public int run(String[] args) throws Exception {
LogAbstractEvent lastEvent = getLastMySQLEvent();
Scan scan;
String lastEventKey = "";
if (lastEvent == null) {
scan = new Scan();
} else {
lastEventKey = lastEvent.getEventKey();
scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE));
}
final Configuration conf = HBaseConfiguration.create(getConf());
HTable table = new HTable(conf, tableName);
ResultScanner resultScanner = table.getScanner(scan);
readRowsToMySQL(resultScanner);
}
21(24)

Java example: Map part (Table)
public class BasicProdStatusHbaseMapper extends TableMapper<Text,
MapWritableComparable> {
@Override public void map(ImmutableBytesWritable key, Result value, Context
context) throws IOException, InterruptedException {
byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER);
Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr);
MapWritableComparable map = new MapWritableComparable();
map.put(new Text("originalCapital"), new
DoubleWritable((Double)caseMap.get("OriginalCapital")));
map.put(new Text("remainingCapital"), new
DoubleWritable((Double)caseMap.get("RemainingCapital")));
context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1);
context.write(new Text(mainStatusCode), map);
}}
22(24)

Java example: Reduce part
public class BasicProdStatusHbaseReducer extends Reducer<Text,
MapWritableComparable, BasicProdStatusWritable, NullWritable> {
@Override protected void reduce(Text key, Iterable<MapWritableComparable>
values, Context context) throws IOException, InterruptedException {
String mainStatusCode = key.toString();
AggregationBean ab = new AggregationBean();
for (MapWritableComparable map : values){
double originalCapital = ((DoubleWritable)map.get(new
Text("originalCapital"))).get();
double remainingCapital = ((DoubleWritable)map.get(new
Text("remainingCapital"))).get();
ab.add(originalCapital,remainingCapital);
}
context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get());
context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }}
23(24)

Q&A
We are hiring!
Big Data processing using Hadoop
infrastructure

Big Data Processing Using Hadoop Infrastructure

More Related Content

What's hot

Viewers also liked

Similar to Big Data Processing Using Hadoop Infrastructure

More from Dmitry Buzdin

Recently uploaded

Big Data Processing Using Hadoop Infrastructure