DWH & BigData – 
architecture approaches 
Odessa 
Vladimir Slobodianiuk 
Date: 2014 
www.luxoft.com
Agenda 
www.luxoft.com 
1 
2 
Big Data – what is it 
Hadoop vs RDBMS – pros and cons 
3 Hadoop & Enterprise architecture 
4 Hadoop as ETL engine
Big Data 
– what is it 
www.luxoft.com
Current state 
 Big data - is an all-encompassing term for any collection of data sets so large and 
complex that it becomes difficult to process using traditional data processing 
applications. 
www.luxoft.com
Limitations & Problems 
www.luxoft.com 
 Big data is difficult to work with using 
most relational databases, requiring 
instead massively parallel software 
running on tens, hundreds, or even 
thousands of servers 
 eBay.com uses two data warehouses at 7.5 petabytes 
 Walmart handles more than 1 million customer 
transactions every hour 
 Facebook handles 50 billion photos from its user base 
 In 2012, the Obama administration announced the Big 
Data Research and Development Initiative
Hadoop vs RDBMS 
www.luxoft.com
CORE HADOOP - MapReduce 
In 2004, Google published a paper on a process called MapReduce 
www.luxoft.com 
 DISTRIBUTED 
COMPUTING 
FRAMEWORK 
 Process large jobs in 
parallel across many 
nodes and combine the 
results
Hadoop Structure 
www.luxoft.com 
 HDFS is a distributed file system designed to run on commodity hardware 
 HBase store data rows in labelled tables (sortable key and an arbitrary number of columns) 
 Hive provide data summarization, query, and analysis (SQL-like interface) 
 Pig is a platform for analyzing large data sets that consists of a high-level language
Hadoop vs RDBMS 
www.luxoft.com 
Hadoop RDBMS 
 Performance for relational data 
 Machine query optimization 
 Mature workload management 
 High concurrency interactive query 
processing 
 Schema-less Model 
 Human query optimization 
 Ability to create complex dataflow 
with multiple inputs and outputs 
 Parallelize many Analytic Functions 
How might this change in the future 
 Query Optimization Improvements in Hive 
– Statistics, better join ordering, more join types, etc 
 Startup Time Improvements 
– Simpler query plans to pass out 
 Runtime Performance Improvements
Hadoop & 
Enterprise architecture 
www.luxoft.com
Classic architecture approach 
www.luxoft.com
Hadoop & Enterprise architecture 
www.luxoft.com
Luxoft Big Data R&D 
Hadoop as ETL Data Quality tool 
www.luxoft.com 
BENEFITS 
 Reduced TCO (commodity hardware usage) 
 Traceability of all the data quality issues 
 Hadoop becomes clean data tool. 
PROBLEM 
Traditional tools show poor performance in exception 
and data cleansing. 
SOLUTION 
Hadoop transforms the data into single format and 
processes it using data cleansing workflows.
Summary 
Big Data: 
 
Cutting edge of DI technologies 
 
State-of-the-art design approaches 
 
A bit more than simple development, it's some of art, art 
of data management 
www.luxoft.com
THANK YOU 
www.luxoft.com

DWH & big data architecture approaches

  • 1.
    DWH & BigData– architecture approaches Odessa Vladimir Slobodianiuk Date: 2014 www.luxoft.com
  • 2.
    Agenda www.luxoft.com 1 2 Big Data – what is it Hadoop vs RDBMS – pros and cons 3 Hadoop & Enterprise architecture 4 Hadoop as ETL engine
  • 3.
    Big Data –what is it www.luxoft.com
  • 4.
    Current state Big data - is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. www.luxoft.com
  • 5.
    Limitations & Problems www.luxoft.com  Big data is difficult to work with using most relational databases, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers  eBay.com uses two data warehouses at 7.5 petabytes  Walmart handles more than 1 million customer transactions every hour  Facebook handles 50 billion photos from its user base  In 2012, the Obama administration announced the Big Data Research and Development Initiative
  • 6.
    Hadoop vs RDBMS www.luxoft.com
  • 7.
    CORE HADOOP -MapReduce In 2004, Google published a paper on a process called MapReduce www.luxoft.com  DISTRIBUTED COMPUTING FRAMEWORK  Process large jobs in parallel across many nodes and combine the results
  • 8.
    Hadoop Structure www.luxoft.com  HDFS is a distributed file system designed to run on commodity hardware  HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)  Hive provide data summarization, query, and analysis (SQL-like interface)  Pig is a platform for analyzing large data sets that consists of a high-level language
  • 9.
    Hadoop vs RDBMS www.luxoft.com Hadoop RDBMS  Performance for relational data  Machine query optimization  Mature workload management  High concurrency interactive query processing  Schema-less Model  Human query optimization  Ability to create complex dataflow with multiple inputs and outputs  Parallelize many Analytic Functions How might this change in the future  Query Optimization Improvements in Hive – Statistics, better join ordering, more join types, etc  Startup Time Improvements – Simpler query plans to pass out  Runtime Performance Improvements
  • 10.
    Hadoop & Enterprisearchitecture www.luxoft.com
  • 11.
  • 12.
    Hadoop & Enterprisearchitecture www.luxoft.com
  • 13.
    Luxoft Big DataR&D Hadoop as ETL Data Quality tool www.luxoft.com BENEFITS  Reduced TCO (commodity hardware usage)  Traceability of all the data quality issues  Hadoop becomes clean data tool. PROBLEM Traditional tools show poor performance in exception and data cleansing. SOLUTION Hadoop transforms the data into single format and processes it using data cleansing workflows.
  • 14.
    Summary Big Data:  Cutting edge of DI technologies  State-of-the-art design approaches  A bit more than simple development, it's some of art, art of data management www.luxoft.com
  • 15.