www.luxoft.com 
DWH & Big Data 
Odessa 
Vladimir Slobodianiuk 
Date: 2014
www.luxoft.com 
Agenda 
1 
2 
Big Data – what is it 
Hadoop vs RDBMS – pros and cons 
3 Hadoop & Enterprise architecture 
4 Hadoop as ETL engine 
5 Case Studies
www.luxoft.com 
Big Data 
– what is it
www.luxoft.com 
Current state 
 Big data - is an all-encompassing term for any collection of data sets so large and 
complex that it becomes difficult to process using traditional data processing 
applications.
www.luxoft.com 
Limitations & Problems 
 Big data is difficult to work with using 
most relational databases, requiring 
instead massively parallel software 
running on tens, hundreds, or even 
thousands of servers 
 eBay.com uses two data warehouses at 7.5 petabytes 
 Walmart handles more than 1 million customer 
transactions every hour 
 Facebook handles 50 billion photos from its user base 
 In 2012, the Obama administration announced the Big 
Data Research and Development Initiative
www.luxoft.com 
Hadoop vs RDBMS
www.luxoft.com 
CORE HADOOP - MapReduce 
In 2004, Google published a paper on a process called MapReduce 
 DISTRIBUTED 
COMPUTING 
FRAMEWORK 
 Process large jobs in 
parallel across many 
nodes and combine the 
results
www.luxoft.com 
Hadoop Structure 
 HDFS is a distributed file system designed to run on commodity hardware 
 HBase store data rows in labelled tables (sortable key and an arbitrary number of columns) 
 Hive provide data summarization, query, and analysis (SQL-like interface) 
 Pig is a platform for analyzing large data sets that consists of a high-level language
www.luxoft.com 
Hadoop vs RDBMS 
Hadoop RDBMS 
 Performance for relational data 
 Machine query optimization 
 Mature workload management 
 High concurrency interactive query 
processing 
How might this change in the future 
 Query Optimization Improvements in Hive 
– Statistics, better join ordering, more join types, etc 
 Startup Time Improvements 
– Simpler query plans to pass out 
 Runtime Performance Improvements 
 Schema-less Model 
 Human query optimization 
 Ability to create complex dataflow 
with multiple inputs and outputs 
 Parallelize many Analytic Functions
www.luxoft.com 
Hadoop & 
Enterprise architecture
www.luxoft.com 
Classic architecture approach
www.luxoft.com 
Hadoop & Enterprise architecture
www.luxoft.com 
Case Study 1 
Hadoop as ETL Data Quality tool 
BENEFITS 
 Reduced TCO (commodity hardware usage) 
 Traceability of all the data quality issues 
 Hadoop becomes clean data tool. 
PROBLEM 
Traditional tools show poor performance in exception 
and data cleansing. 
SOLUTION 
Hadoop transforms the data into single format and 
processes it using data cleansing workflows.
www.luxoft.com 
Case Study 2 
Know Your Customer PoC 
Business Challenge 
• Knowing the actual customer 
reaction to products is essential 
for business growth, but it’s 
difficult to get valuable insights. 
Social media is the place where 
customer really share their 
opinion 
SOLUTION 
Hadoop-based analysis tool that 
provides the ability to: 
• Find the events in the client 
streams, identify needed 
reaction 
• Propose a product to a client, 
based on his interests
www.luxoft.com 
Case Study 3 
Enterprise ETL & Hadoop Integration 
Goals: 
 MapReduce ETL jobs development 
without coding 
 Build, re-use, and check impact analysis 
with enhanced metadata capabilities 
 A windows-based graphical development 
environment 
 Comprehensive built-in transformations 
 A library of Use Case Accelerators to 
fast-track Hadoop productivity
www.luxoft.com 
Big Data: 
 
Cutting edge of DI technologies 
 
State-of-the-art design approaches 
 
A bit more than simple development, it's some of art, art 
of data management 
Summary
www.luxoft.com 
THANK YOU

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)

  • 1.
    www.luxoft.com DWH &Big Data Odessa Vladimir Slobodianiuk Date: 2014
  • 2.
    www.luxoft.com Agenda 1 2 Big Data – what is it Hadoop vs RDBMS – pros and cons 3 Hadoop & Enterprise architecture 4 Hadoop as ETL engine 5 Case Studies
  • 3.
    www.luxoft.com Big Data – what is it
  • 4.
    www.luxoft.com Current state  Big data - is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
  • 5.
    www.luxoft.com Limitations &Problems  Big data is difficult to work with using most relational databases, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers  eBay.com uses two data warehouses at 7.5 petabytes  Walmart handles more than 1 million customer transactions every hour  Facebook handles 50 billion photos from its user base  In 2012, the Obama administration announced the Big Data Research and Development Initiative
  • 6.
  • 7.
    www.luxoft.com CORE HADOOP- MapReduce In 2004, Google published a paper on a process called MapReduce  DISTRIBUTED COMPUTING FRAMEWORK  Process large jobs in parallel across many nodes and combine the results
  • 8.
    www.luxoft.com Hadoop Structure  HDFS is a distributed file system designed to run on commodity hardware  HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)  Hive provide data summarization, query, and analysis (SQL-like interface)  Pig is a platform for analyzing large data sets that consists of a high-level language
  • 9.
    www.luxoft.com Hadoop vsRDBMS Hadoop RDBMS  Performance for relational data  Machine query optimization  Mature workload management  High concurrency interactive query processing How might this change in the future  Query Optimization Improvements in Hive – Statistics, better join ordering, more join types, etc  Startup Time Improvements – Simpler query plans to pass out  Runtime Performance Improvements  Schema-less Model  Human query optimization  Ability to create complex dataflow with multiple inputs and outputs  Parallelize many Analytic Functions
  • 10.
    www.luxoft.com Hadoop & Enterprise architecture
  • 11.
  • 12.
    www.luxoft.com Hadoop &Enterprise architecture
  • 13.
    www.luxoft.com Case Study1 Hadoop as ETL Data Quality tool BENEFITS  Reduced TCO (commodity hardware usage)  Traceability of all the data quality issues  Hadoop becomes clean data tool. PROBLEM Traditional tools show poor performance in exception and data cleansing. SOLUTION Hadoop transforms the data into single format and processes it using data cleansing workflows.
  • 14.
    www.luxoft.com Case Study2 Know Your Customer PoC Business Challenge • Knowing the actual customer reaction to products is essential for business growth, but it’s difficult to get valuable insights. Social media is the place where customer really share their opinion SOLUTION Hadoop-based analysis tool that provides the ability to: • Find the events in the client streams, identify needed reaction • Propose a product to a client, based on his interests
  • 15.
    www.luxoft.com Case Study3 Enterprise ETL & Hadoop Integration Goals:  MapReduce ETL jobs development without coding  Build, re-use, and check impact analysis with enhanced metadata capabilities  A windows-based graphical development environment  Comprehensive built-in transformations  A library of Use Case Accelerators to fast-track Hadoop productivity
  • 16.
    www.luxoft.com Big Data:  Cutting edge of DI technologies  State-of-the-art design approaches  A bit more than simple development, it's some of art, art of data management Summary
  • 17.