2. Use case
Intrum Justitia SDC
• 20 countries / different applications to process, store and analyze data
• Non-unified data storage formats
• High number of data objects (Records, Transactions, Entities)
• May have complex strong or loose relation rules
• Often: involving time-stamped events, made of incomplete data
2(24)
3. Possible solutions
• Custom built from ground solution
• Semi-clustered approach
‒ Tools from Oracle
‒ MySQL/PostgreSQL nodes
‒ Document-oriented tools like MongoDB
• “Big Data” approach (Map-Reduce)
3(24)
4. Map-Reduce
• Simple programming model that applies
to many large-scale computing problems
• Availabe in MongoDB for sharded data
• MapReduce tools usually offers:
‒ automatic parallelization
‒ load balancing
‒ network/disk transfer optimization
‒ handling of machine failures
‒ robustness
• Introduced by Google, open-source
implementation by Apache (Hadoop),
enterprise support by Cloudera
4(24)
6. HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
• Mountable (FUSE)
6(24)
7. HDFS file read
Code to data, not data to code
7(24)
Client application
HDFS client
Name node
/bob/file.txt
Block A
Block B
DataNode 2
DataNode 3
DataNode 1
DataNode 3
DataNode 1
C
B
D
DataNode 2
C
A
D
DataNode 3
C
B
A
1
4
4
2
3
9. 9(24)
Hive Pig
• High-level data access language
• Data Warehouse System for
Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• High-level data access language
(zero Java knowledge required)
• Data preparation (ETL)
• Pig Latin scripting language
SQL
Hive
MapReduce
Pig Latin
Pig
MapReduce
10. HiveQL vs. Pig Latin
insert into ValClickPerDMA
select dma, count(*) from geoinfo
join (
select name, ipaddr from users
join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
10(24)
Pig Latin is procedural,
where HQL is declarative.
11. Impala
• Real time queries ~100x faster comparing to Hive
• Direct data access
• Query data on HDFS or HBase
• Allows table joins and aggregation
11(24)
ODBC
Impala
HDFS HBase
12. Parquet
• Row Groups: A group of rows in columnar format
‒ One (or more) per split while reading
‒ Max size buffered in memory while writing
‒ About 50MB < row group < 1GB
• Columns Chunk: Data for one column in row group
‒ Column chunks can be read independently for efficient scans
• Page: Unit of access in a column chunk
‒ Should be big enough for efficient compression
‒ Min size to read while accessing a single record
‒ About 8KB < page < 1MB
Lars George, Cloudera. Data I/O, 2013
12(24)
13. HBase
Column-oriented data storage. Very large tables – billions
of rows X millions of columns.
• Low Latency
• Random Reads And Writes (by PK)
• Distributed Key/Value Store; automatic region sharding
• Simple API
‒ PUT
‒ GET
‒ DELETE
‒ SCAN
13(24)
14. HBase building blocks
• The most basic unit in HBase is a column
‒ Each column may have multiply versions with each distinct value contained in
separate cell
‒ One or more columns from a row, that is addressed uniquely by a row key
‒ Can have millions of columns
‒ Can be compressed or tagged to stay in memory
• A table is a collection of rows
‒ All rows and columns are always sorted lexicographically by their row key
14(24)
16. Hadoop
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Workflow is a collection of actions
• Arranged in Directed Acyclic Graph
16(24)
Job submission
Oozie server
"All done"
MapReduce
Pig
...
Schedule
Result
Schedule
Failure
Re-schedule
Result
19. Hadoop infrastructure integration
18(24)
TxB
IW extract
program
IW extract
program
IW extract
program
TxB
TxB
Data In HDFS
RAW
(S)FTP
SCP
HTTP(S)
JDBC
HDFS
Binary
HDFS
Results
Data Out
Intrum Web
PAM
GSS
Catalyst
Dashboard
Parsing&Validation
Conversion&Compression
DataQualityAnalysis
BusinessAnalytics
DataTransformation
DataDelivery
Monitoring and Management
Hadoop Cluster