Hybrid architecture integrateduserviewdata-peyman_mohajerian

Hybrid Architecture for Integrated User
View of Data of different Temperature
and Velocity
Peyman Mohajerian
Altan Kendup

OVERVIEW
• Introduction
• Use Cases
• Teradata UDA
• LAMBDA Architecture
• Demo

Teradata Hadoop COE
• Who we are
– The experts in Big Data within Teradata on Hadoop and the
Teradata Unified Data Architecture
– Experienced professionals with years of experience in
organizational adoption, architecture, design, implementation
and best practices of Big Data
• What we do
– Partner with customers using our experience and insights in
helping to make the best decisions and solutions possible with
regards to Big Data initiatives within an organization
• What do we work with
– Hadoop mostly: Hortonworks mostly, but also other distros
– And other related technologies: Search, NoSQL, RDBMs, etc.

Use Cases
• Risk Assessment
a) Fast/Hot- Social Data
b) Slow/Cold- User profile
• Internet of Thing
a. Fast/Hot- Sensory Data
b. Slow/Cold- Events
• Natural Language Processing
a) Fast/Hot- Tagging Stream of Text
b) Slow/Cold- Aggregate View, e.g. Tag Count over larger data set

Source
Relational Database
ODS
Hot
Warm
Hadoop
Appliance
Hive.13/Tez
ODS
Modeled
HCATALOG
SQL-H
Query Driver
(cli, odbc, jdbc…)
Hadoop & Relational DB Approach
1
2
All Hist
/Hive 0.13/Tez
Hot/Warm/Cold (Avro/ORC)
FALCON
FOR DATA MANAGEMENT, LINEAGE, RETENTION, REPLICATION FACTORS
SQL-H/
TeraData
Batch Load/HDF put
Adhoc Queries
Query Driver
(cli, odbc,
jdbc…)
Luke-Warm
+6 months of data (ORC)
Stage
ETL
(SQL, mapping,
xform,..)
ETL-V
(SQL, mapping,
xform,..)
3
2
2
3
2
1
Fastload
O
O
Z
I
E
2
4
4
modeled

Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts'
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
APPLICATIONS
USERS
QUERY GRID, UNITY,
SMART LOADER
CONNECTORS
UNITY
VIEWPOINT
TVI
MDM
TERADATA UNIFIED DATA ARCHITECTURE
INTEGRATED DATA WAREHOUSE
DISCOVERY PLATFORM
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
TERADATA
PORTFOLIO FOR
HADOOP
TERADATA
DATABASE
TERADATA ASTER
DATABASE
DATA
PLATFORM

HADOOP LANGUAGESOTHER
DATABASES
Remote,
push-down
processing in Hadoop
Teradata
Databases
Aster functions
such as SQL-
MapReduce™,
graph
When fully implemented, the Teradata Database or the Teradata Aster Database will be able to
intelligently use the functionality and data of multiple heterogeneous processing engines
Teradata QueryGrid™
TERADATA
ASTER
DATABASE
IDW Discovery
TERADATA
DATABASE
TERADATA
DATABASE
TERADATA
ASTER
DATABASE
RDBMS
Databases
Leverage
Languages such
as SAS, Perl,
Python, Ruby, R

8 6/23/2014 Teradata Confidential
Join Hadoop and Teradata Tables
SELECT e.last_name, e.first_name, d.department_number,
d.department_name FROM ( on empty
(USING server(‘ ') port('9083')
username('hive') dbname(‘ ')
tablename(‘ ‘’) columns( *) templeton_port(‘1880’)
As d, e where
order by 1
Load_From_Hcatalog
192.168.100.200
default
department
e.Department_number = d.department_number
Employee
Hadoop System In built TD function
Hadoop TableTeradata Table Join Condition
TERADATA QUERYGRID: TERADATA-HADOOP

LAMBDA ARCHITECTURE - OVERVIEW
BRIEF BACKGROUND
• Reference architecture for Big Data systems
– Emphasis on real-time
– Designed by Nathan Marz (Twitter)
• Big Data system definition
– Defined as a system that runs arbitrary functions
on arbitrary data
• “query = function(all data)”
DESIGN PRINCIPLES
• Human Fault-tolerance
– The system is unsusceptible to data loss or data
corruption because at scale it could be
irreparable.
• Data Immutability
– Store data in it’s rawest form immutable and in
perpetuity.
• Computation
– With the two principles above it is always
possible to (re-)compute results by running a
function on the raw data.
BATCH LAYER
> Contains the immutable, constantly growing master dataset.
> Use batch processing to create arbitrary views from this raw dataset.
SERVING LAYER
> Loads and exposes the batch views in a data store so that they can be queried.
– Does not require random writes; must support batch updates and random reads
SPEED LAYER
> Deals only with new data and compensates for the high latency updates of the serving
layer.
– Leverages stream processing and random read/write data stores to compute real-
time views
> Views remain valid until the data have found their way through the batch and serving
layer.

IMPLEMENTATION OF LAMBDA ON TERADATA HADOOP
APPLIANCE

EXAMPLE TECH STACK
Query
Batch Layer
Speed Layer
Serving Layer
Data Load
Storm
Sqoop/HDFS put
Hbase/Elastic Search
…… NOSQL
Hadoop/Elephant DB

Storm Topology Setup
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
}

Storm Bolt
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}

• Zookeeper
– Used by both Storm and Hbase so be aware of this dependency
• Search engines (i.e. Elasticsearch/Solr) as serving layers
– Used for free-form queries and other capabilities for realtime processing.
– Can interact with their own data visualizations (i.e. Kibana)
– Can become a primary means of data interactions though not necessarily
• Storm
– All custom code; nothing pre-packaged
• Follow custom coding techniques/approaches
• For TDH Storm and most serving layers are *REQUIRED* to be on Edge Nodes
LAMBDA ON TDH TECHNICAL CONSIDERATIONS

• Requires more than just modeling
– Needs equivalent data primitives – higher orders of raw data
– Must conform to question focused queries
– Practically speaking need to consider complete history and all subsequent changes
• Reconciliation process
– Necessity for providing accurate results via batch and realtime
– Must again be question focused per user queries
• Query Focused datasets typically require
– Business primitives – base core query
– Business primitives mapping to full query
– Business primitives/data primitives mash-ups
LAMBDA ON TDH DATA ARCHITECTURE CONSIDERATIONS

• Dynamic Data Architecture
– Adaptive to business needs; provides answers to business
• Cuts across technologies, runs on all technologies that serve the architecture
– Powerful, simple model
• Data primitives – metadata, formulas, raw data snapshot
• Business primitives – metadata, queries, results snapshots
• Primitives immutable, automated and fully re-computable at any point in time
• Also can be secured, backed-up, audited, and managed independent of existing technology stacks
– Powerful reconciliation process aka “The Merge”
• Simple/extensible rules engine based on data value and priority
– Qualified, Verified, Version, Priority, and other controls
• Capable of speed and verifiable results via automation
– All of these referenceable via “Dynamic Data Dictionary”
“DYNAMIC DATA ARCHITECTURE”

• Implementing Lambda: Great little article on Dr. Dobbs
• Lambda Architecture Explained: Excellent article on InfoQ
• Lambda Architectural Brief: Good brief on Lambda
• Lambda Architectural Overview: Excellent view of Lambda
• Practical Lambda Architecture: Great presentation on Slideshare
LAMBDA ARCHITECTURE WEB REFERENCES

DEMO
ChannelSource/
logs
ES Sink
Flume
Elastic Search
Kibana
GUI

Hybrid architecture integrateduserviewdata-peyman_mohajerian

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hybrid architecture integrateduserviewdata-peyman_mohajerian

Similar to Hybrid architecture integrateduserviewdata-peyman_mohajerian (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Hybrid architecture integrateduserviewdata-peyman_mohajerian

Editor's Notes