3. Teradata Hadoop COE
• Who we are
– The experts in Big Data within Teradata on Hadoop and the
Teradata Unified Data Architecture
– Experienced professionals with years of experience in
organizational adoption, architecture, design, implementation
and best practices of Big Data
• What we do
– Partner with customers using our experience and insights in
helping to make the best decisions and solutions possible with
regards to Big Data initiatives within an organization
• What do we work with
– Hadoop mostly: Hortonworks mostly, but also other distros
– And other related technologies: Search, NoSQL, RDBMs, etc.
4. Use Cases
• Risk Assessment
a) Fast/Hot- Social Data
b) Slow/Cold- User profile
• Internet of Thing
a. Fast/Hot- Sensory Data
b. Slow/Cold- Events
• Natural Language Processing
a) Fast/Hot- Tagging Stream of Text
b) Slow/Cold- Aggregate View, e.g. Tag Count over larger data set
5. Source
Relational Database
ODS
Hot
Warm
Hadoop
Appliance
Hive.13/Tez
ODS
Modeled
HCATALOG
SQL-H
Query Driver
(cli, odbc, jdbc…)
Hadoop & Relational DB Approach
1
2
All Hist
/Hive 0.13/Tez
Hot/Warm/Cold (Avro/ORC)
FALCON
FOR DATA MANAGEMENT, LINEAGE, RETENTION, REPLICATION FACTORS
SQL-H/
TeraData
Batch Load/HDF put
Adhoc Queries
Query Driver
(cli, odbc,
jdbc…)
Luke-Warm
+6 months of data (ORC)
Stage
ETL
(SQL, mapping,
xform,..)
ETL-V
(SQL, mapping,
xform,..)
3
2
2
3
2
1
Fastload
O
O
Z
I
E
2
4
4
modeled
7. HADOOP LANGUAGESOTHER
DATABASES
Remote,
push-down
processing in Hadoop
Teradata
Databases
Aster functions
such as SQL-
MapReduce™,
graph
When fully implemented, the Teradata Database or the Teradata Aster Database will be able to
intelligently use the functionality and data of multiple heterogeneous processing engines
Teradata QueryGrid™
TERADATA
ASTER
DATABASE
IDW Discovery
TERADATA
DATABASE
TERADATA
DATABASE
TERADATA
ASTER
DATABASE
RDBMS
Databases
Leverage
Languages such
as SAS, Perl,
Python, Ruby, R
8. 8 6/23/2014 Teradata Confidential
Join Hadoop and Teradata Tables
SELECT e.last_name, e.first_name, d.department_number,
d.department_name FROM ( on empty
(USING server(‘ ') port('9083')
username('hive') dbname(‘ ')
tablename(‘ ‘’) columns( *) templeton_port(‘1880’)
As d, e where
order by 1
Load_From_Hcatalog
192.168.100.200
default
department
e.Department_number = d.department_number
Employee
Hadoop System In built TD function
Hadoop TableTeradata Table Join Condition
TERADATA QUERYGRID: TERADATA-HADOOP
9. 9 6/23/2014 Teradata Confidential
LAMBDA ARCHITECTURE - OVERVIEW
BRIEF BACKGROUND
• Reference architecture for Big Data systems
– Emphasis on real-time
– Designed by Nathan Marz (Twitter)
• Big Data system definition
– Defined as a system that runs arbitrary functions
on arbitrary data
• “query = function(all data)”
DESIGN PRINCIPLES
• Human Fault-tolerance
– The system is unsusceptible to data loss or data
corruption because at scale it could be
irreparable.
• Data Immutability
– Store data in it’s rawest form immutable and in
perpetuity.
• Computation
– With the two principles above it is always
possible to (re-)compute results by running a
function on the raw data.
BATCH LAYER
> Contains the immutable, constantly growing master dataset.
> Use batch processing to create arbitrary views from this raw dataset.
SERVING LAYER
> Loads and exposes the batch views in a data store so that they can be queried.
– Does not require random writes; must support batch updates and random reads
SPEED LAYER
> Deals only with new data and compensates for the high latency updates of the serving
layer.
– Leverages stream processing and random read/write data stores to compute real-
time views
> Views remain valid until the data have found their way through the batch and serving
layer.
10. 10 6/23/2014 Teradata Confidential
IMPLEMENTATION OF LAMBDA ON TERADATA HADOOP
APPLIANCE
11. EXAMPLE TECH STACK
Query
Batch Layer
Speed Layer
Serving Layer
Data Load
Storm
Sqoop/HDFS put
Hbase/Elastic Search
…… NOSQL
Hadoop/Elephant DB
13. Storm Topology Setup
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
}
14. Storm Bolt
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
15. 15 6/23/2014 Teradata Confidential
• Zookeeper
– Used by both Storm and Hbase so be aware of this dependency
• Search engines (i.e. Elasticsearch/Solr) as serving layers
– Used for free-form queries and other capabilities for realtime processing.
– Can interact with their own data visualizations (i.e. Kibana)
– Can become a primary means of data interactions though not necessarily
• Storm
– All custom code; nothing pre-packaged
• Follow custom coding techniques/approaches
• For TDH Storm and most serving layers are *REQUIRED* to be on Edge Nodes
LAMBDA ON TDH TECHNICAL CONSIDERATIONS
16. 16 6/23/2014 Teradata Confidential
• Requires more than just modeling
– Needs equivalent data primitives – higher orders of raw data
– Must conform to question focused queries
– Practically speaking need to consider complete history and all subsequent changes
• Reconciliation process
– Necessity for providing accurate results via batch and realtime
– Must again be question focused per user queries
• Query Focused datasets typically require
– Business primitives – base core query
– Business primitives mapping to full query
– Business primitives/data primitives mash-ups
LAMBDA ON TDH DATA ARCHITECTURE CONSIDERATIONS
17. 17 6/23/2014 Teradata Confidential
• Dynamic Data Architecture
– Adaptive to business needs; provides answers to business
• Cuts across technologies, runs on all technologies that serve the architecture
– Powerful, simple model
• Data primitives – metadata, formulas, raw data snapshot
• Business primitives – metadata, queries, results snapshots
• Primitives immutable, automated and fully re-computable at any point in time
• Also can be secured, backed-up, audited, and managed independent of existing technology stacks
– Powerful reconciliation process aka “The Merge”
• Simple/extensible rules engine based on data value and priority
– Qualified, Verified, Version, Priority, and other controls
• Capable of speed and verifiable results via automation
– All of these referenceable via “Dynamic Data Dictionary”
“DYNAMIC DATA ARCHITECTURE”
18. 18 6/23/2014 Teradata Confidential
• Implementing Lambda: Great little article on Dr. Dobbs
• Lambda Architecture Explained: Excellent article on InfoQ
• Lambda Architectural Brief: Good brief on Lambda
• Lambda Architectural Overview: Excellent view of Lambda
• Practical Lambda Architecture: Great presentation on Slideshare
LAMBDA ARCHITECTURE WEB REFERENCES
Teradata Hadoop Platform is an integral component of the Teradata Unified Data Architecture™ which is the only truly unified solution on the market that aligns the best technology to the specific analytic need. Teradata Unified Data Architecture™ leverages the best-of-breed and complementary values of Teradata, Teradata Aster, and open source Hadoop to align the best technology to the specific analytic need; all engineered, configured, and delivered ready to run.
Teradata integrates key value-add enabling technologies such as BYNET® , Teradata Viewpoint, Teradata Unity, SQL Assistant, data connectors, and global support model to provide transparent access, seamless data movement, and a single operational view of your Unified Data Architecture™ .
carpricedata
Lambda has been used for a while by very sophisticated shops. As a result the issue is not just about the architecture but particular parts of the architecture. The main pain point is the what is called the “Merge” or “Reconciliation Process”.
Overview
Data is consumed from source systems via the Edge Nodes. NOTE: An ESB is shown as that is sometimes used but it is not a requirement.
Consumption occurs via interactions with the TDH appliance Edge Nodes specifically Apache Storm (ver 0.9.1 as of this writing on 5/1/2014)
Apache Storm is the framework used with the logic stored within a topology, consisting of:
* Elasticsearch Spouts/Bolts which processes the data and delivers it to Elasticsearch
* Flume Spouts/Bolts which processes the data and delivers it to Flume for delivery to HDFS
* JMS Spouts/Bolts which processes the data and delivers it to/receives from the ESB.
@ Storm has a heavy reliance on Zookeeper resources
4) Storm processes the data and delivers them to both the batch layer via Flume-NG and the speed layer via Elasticsearch.
* Elasticsearch indexes the data and stores it and then makes it available for query in realtime to the business
* Flume-NG deposits data onto HDFS for processing by the batch layer (Teradata Hadoop)
* Hbase is also fed via Storm to provide additional processing of views for the business
5) Once the data has been processed it can be shared with others. Again an ESB is shown as an option but not a requirement.
6) The various consumers of this data can be any number of applications, etc.
7) Ultimately various business communities consume the results and continue to interact with the appliance to meet their needs.
Storm also simultaneously processes the data and delivers it to Flume the flume agent. Flume then processes the data to HDFS.
The HDFS request travels to the master node which then maps the appropriate metadata for the HDFS request.
The data is finally distributed to HDFS for processing within the Hadoop cluster.