Hybrid architecture integrateduserviewdata-peyman_mohajerian


Published on

Big Data Camp LA 2014, Hybrid Architecture for Integrated User View of Data of different Temperature and Velocity by Peyman Mohajerian of Teradata

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Teradata Hadoop Platform is an integral component of the Teradata Unified Data Architecture™ which is the only truly unified solution on the market that aligns the best technology to the specific analytic need. Teradata Unified Data Architecture™ leverages the best-of-breed and complementary values of Teradata, Teradata Aster, and open source Hadoop to align the best technology to the specific analytic need; all engineered, configured, and delivered ready to run.

    Teradata integrates key value-add enabling technologies such as BYNET® , Teradata Viewpoint, Teradata Unity, SQL Assistant, data connectors, and global support model to provide transparent access, seamless data movement, and a single operational view of your Unified Data Architecture™ .
  • carpricedata
  • Lambda has been used for a while by very sophisticated shops. As a result the issue is not just about the architecture but particular parts of the architecture. The main pain point is the what is called the “Merge” or “Reconciliation Process”.
  • Overview
    Data is consumed from source systems via the Edge Nodes. NOTE: An ESB is shown as that is sometimes used but it is not a requirement.
    Consumption occurs via interactions with the TDH appliance Edge Nodes specifically Apache Storm (ver 0.9.1 as of this writing on 5/1/2014)
    Apache Storm is the framework used with the logic stored within a topology, consisting of:
    * Elasticsearch Spouts/Bolts which processes the data and delivers it to Elasticsearch
    * Flume Spouts/Bolts which processes the data and delivers it to Flume for delivery to HDFS
    * JMS Spouts/Bolts which processes the data and delivers it to/receives from the ESB.
    @ Storm has a heavy reliance on Zookeeper resources
    4) Storm processes the data and delivers them to both the batch layer via Flume-NG and the speed layer via Elasticsearch.
    * Elasticsearch indexes the data and stores it and then makes it available for query in realtime to the business
    * Flume-NG deposits data onto HDFS for processing by the batch layer (Teradata Hadoop)
    * Hbase is also fed via Storm to provide additional processing of views for the business
    5) Once the data has been processed it can be shared with others. Again an ESB is shown as an option but not a requirement.
    6) The various consumers of this data can be any number of applications, etc.
    7) Ultimately various business communities consume the results and continue to interact with the appliance to meet their needs.

    Storm also simultaneously processes the data and delivers it to Flume the flume agent. Flume then processes the data to HDFS.
    The HDFS request travels to the master node which then maps the appropriate metadata for the HDFS request.
    The data is finally distributed to HDFS for processing within the Hadoop cluster.

  • Hybrid architecture integrateduserviewdata-peyman_mohajerian

    1. 1. Hybrid Architecture for Integrated User View of Data of different Temperature and Velocity Peyman Mohajerian Altan Kendup
    2. 2. OVERVIEW • Introduction • Use Cases • Teradata UDA • LAMBDA Architecture • Demo
    3. 3. Teradata Hadoop COE • Who we are – The experts in Big Data within Teradata on Hadoop and the Teradata Unified Data Architecture – Experienced professionals with years of experience in organizational adoption, architecture, design, implementation and best practices of Big Data • What we do – Partner with customers using our experience and insights in helping to make the best decisions and solutions possible with regards to Big Data initiatives within an organization • What do we work with – Hadoop mostly: Hortonworks mostly, but also other distros – And other related technologies: Search, NoSQL, RDBMs, etc.
    4. 4. Use Cases • Risk Assessment a) Fast/Hot- Social Data b) Slow/Cold- User profile • Internet of Thing a. Fast/Hot- Sensory Data b. Slow/Cold- Events • Natural Language Processing a) Fast/Hot- Tagging Stream of Text b) Slow/Cold- Aggregate View, e.g. Tag Count over larger data set
    5. 5. Source Relational Database ODS Hot Warm Hadoop Appliance Hive.13/Tez ODS Modeled HCATALOG SQL-H Query Driver (cli, odbc, jdbc…) Hadoop & Relational DB Approach 1 2 All Hist /Hive 0.13/Tez Hot/Warm/Cold (Avro/ORC) FALCON FOR DATA MANAGEMENT, LINEAGE, RETENTION, REPLICATION FACTORS SQL-H/ TeraData Batch Load/HDF put Adhoc Queries Query Driver (cli, odbc, jdbc…) Luke-Warm +6 months of data (ORC) Stage ETL (SQL, mapping, xform,..) ETL-V (SQL, mapping, xform,..) 3 2 2 3 2 1 Fastload O O Z I E 2 4 4 modeled
    6. 6. Marketing Executives Operational Systems Frontline Workers Customers Partners Engineers Data Scientists Business Analysts' Math and Stats Data Mining Business Intelligence Applications Languages Marketing APPLICATIONS USERS QUERY GRID, UNITY, SMART LOADER CONNECTORS UNITY VIEWPOINT TVI MDM TERADATA UNIFIED DATA ARCHITECTURE INTEGRATED DATA WAREHOUSE DISCOVERY PLATFORM ERP SCM CRM Images Audio and Video Machine Logs Text Web and Social SOURCES TERADATA PORTFOLIO FOR HADOOP TERADATA DATABASE TERADATA ASTER DATABASE DATA PLATFORM
    7. 7. HADOOP LANGUAGESOTHER DATABASES Remote, push-down processing in Hadoop Teradata Databases Aster functions such as SQL- MapReduce™, graph When fully implemented, the Teradata Database or the Teradata Aster Database will be able to intelligently use the functionality and data of multiple heterogeneous processing engines Teradata QueryGrid™ TERADATA ASTER DATABASE IDW Discovery TERADATA DATABASE TERADATA DATABASE TERADATA ASTER DATABASE RDBMS Databases Leverage Languages such as SAS, Perl, Python, Ruby, R
    8. 8. 8 6/23/2014 Teradata Confidential Join Hadoop and Teradata Tables SELECT e.last_name, e.first_name, d.department_number, d.department_name FROM ( on empty (USING server(‘ ') port('9083') username('hive') dbname(‘ ') tablename(‘ ‘’) columns( *) templeton_port(‘1880’) As d, e where order by 1 Load_From_Hcatalog default department e.Department_number = d.department_number Employee Hadoop System In built TD function Hadoop TableTeradata Table Join Condition TERADATA QUERYGRID: TERADATA-HADOOP
    9. 9. 9 6/23/2014 Teradata Confidential LAMBDA ARCHITECTURE - OVERVIEW BRIEF BACKGROUND • Reference architecture for Big Data systems – Emphasis on real-time – Designed by Nathan Marz (Twitter) • Big Data system definition – Defined as a system that runs arbitrary functions on arbitrary data • “query = function(all data)” DESIGN PRINCIPLES • Human Fault-tolerance – The system is unsusceptible to data loss or data corruption because at scale it could be irreparable. • Data Immutability – Store data in it’s rawest form immutable and in perpetuity. • Computation – With the two principles above it is always possible to (re-)compute results by running a function on the raw data. BATCH LAYER > Contains the immutable, constantly growing master dataset. > Use batch processing to create arbitrary views from this raw dataset. SERVING LAYER > Loads and exposes the batch views in a data store so that they can be queried. – Does not require random writes; must support batch updates and random reads SPEED LAYER > Deals only with new data and compensates for the high latency updates of the serving layer. – Leverages stream processing and random read/write data stores to compute real- time views > Views remain valid until the data have found their way through the batch and serving layer.
    10. 10. 10 6/23/2014 Teradata Confidential IMPLEMENTATION OF LAMBDA ON TERADATA HADOOP APPLIANCE
    11. 11. EXAMPLE TECH STACK Query Batch Layer Speed Layer Serving Layer Data Load Storm Sqoop/HDFS put Hbase/Elastic Search …… NOSQL Hadoop/Elephant DB
    12. 12. Apache Storm
    13. 13. Storm Topology Setup public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
    14. 14. Storm Bolt public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
    15. 15. 15 6/23/2014 Teradata Confidential • Zookeeper – Used by both Storm and Hbase so be aware of this dependency • Search engines (i.e. Elasticsearch/Solr) as serving layers – Used for free-form queries and other capabilities for realtime processing. – Can interact with their own data visualizations (i.e. Kibana) – Can become a primary means of data interactions though not necessarily • Storm – All custom code; nothing pre-packaged • Follow custom coding techniques/approaches • For TDH Storm and most serving layers are *REQUIRED* to be on Edge Nodes LAMBDA ON TDH TECHNICAL CONSIDERATIONS
    16. 16. 16 6/23/2014 Teradata Confidential • Requires more than just modeling – Needs equivalent data primitives – higher orders of raw data – Must conform to question focused queries – Practically speaking need to consider complete history and all subsequent changes • Reconciliation process – Necessity for providing accurate results via batch and realtime – Must again be question focused per user queries • Query Focused datasets typically require – Business primitives – base core query – Business primitives mapping to full query – Business primitives/data primitives mash-ups LAMBDA ON TDH DATA ARCHITECTURE CONSIDERATIONS
    17. 17. 17 6/23/2014 Teradata Confidential • Dynamic Data Architecture – Adaptive to business needs; provides answers to business • Cuts across technologies, runs on all technologies that serve the architecture – Powerful, simple model • Data primitives – metadata, formulas, raw data snapshot • Business primitives – metadata, queries, results snapshots • Primitives immutable, automated and fully re-computable at any point in time • Also can be secured, backed-up, audited, and managed independent of existing technology stacks – Powerful reconciliation process aka “The Merge” • Simple/extensible rules engine based on data value and priority – Qualified, Verified, Version, Priority, and other controls • Capable of speed and verifiable results via automation – All of these referenceable via “Dynamic Data Dictionary” “DYNAMIC DATA ARCHITECTURE”
    18. 18. 18 6/23/2014 Teradata Confidential • Implementing Lambda: Great little article on Dr. Dobbs • Lambda Architecture Explained: Excellent article on InfoQ • Lambda Architectural Brief: Good brief on Lambda • Lambda Architectural Overview: Excellent view of Lambda • Practical Lambda Architecture: Great presentation on Slideshare LAMBDA ARCHITECTURE WEB REFERENCES
    19. 19. DEMO ChannelSource/ logs ES Sink Flume Elastic Search Kibana GUI