Successfully reported this slideshow.
Your SlideShare is downloading. ×

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.talk.2015

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 27 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (18)

Advertisement

Similar to Artur Borycki - Beyond Lambda - how to get from logical to physical - code.talk.2015 (20)

More from AboutYouGmbH (20)

Advertisement

Recently uploaded (20)

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.talk.2015

  1. 1. Beyond Lambda - how to get from logical to physical Artur Borycki Director Technology & Innovations
  2. 2. Simplification & Efficiency Teradata believe in the principles of self-service, automation and on-demand resource allocation. These enable faster, more efficient and more effective data application development and operation.
  3. 3. ‹#› What is Lambda Architecture Background • Reference architecture for Big Data systems • Designed by Nathan Marz (Twitter) • Defined as a system that runs arbitrary functions on arbitrary data • “query = function(all data)” Design Principles • Human fault-tolerant, Immutability, Computable Lambda Layers • Batch - Contains the immutable, constantly growing master dataset. • Speed - Deals only with new data and compensates for the high latency updates of the serving layer. • Serving - Loads and exposes the combined view of data so that they can be queried.
  4. 4. ‹#› Active Executor Lambda Framework • The way this works is that an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. • You implement your transformation logic twice, once in the batch system and once in the stream processing system. • You stitch together the results from both systems at query time to produce a complete answer.
  5. 5. Overall Architecture – Customer example
  6. 6. ‹#› Lambda alternative – Kappa? (Jay Kreps – Linkedin) Unlike the Lambda Architecture, in this approach you only do reprocessing when your processing code changes 1. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. 2. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. 3. When the second job has caught up, switch the application to read from the new table. 4. Stop the old version of the job, and delete the old output table.
  7. 7. Real-time Maturity Typical path for a customer Customers typically go through four stages on their path to real-time analysis. The evolution typically starts with trying to visualize results or reports more frequently. This leads to the realization that the underlying data is not refreshed frequently. The next stage of maturity is to capture and ingest data more quickly. Once data is flowing faster, customers then try to process the data as it is flowing. The final stage is to remove any human intervention.
  8. 8. ‹#› Events  /   Interactions Consumer   of  Information   All  dataStreams Other  feeds Consumer   of  Information   Discovery Advance   Analytics Data  binding Reporting Beyond Lambda – Omega ;) (Artur vision) • We need events that require actions and interactions without much of the analytics • We need events that are requiring action, but also they need to be enhanced by the analytics in the ecosystem (based on other information sources) • We need events that will be handled later or they are supporting above cases
  9. 9. ‹#› The Teradata UDA UNIFIED DATA ARCHITECTURE Security, Workload Management Applications INTEGRATED DATA WAREHOUSE DATA PLATFORM INTEGRATED DISCOVERY PLATFORM Security, Workload ManagementREAL TIME PROCESSING TERADATA PORTFOLIO FOR HADOOP TERADATA DATABASE TERADATA ASTER DATABASE RESTFULAPI LISTENINGFRAMEWORK RESTFULAPI APPFRAMEWORK
  10. 10. 10 1 0 BEST APP EVER!! Data  Service  APIs Access  Data  on  Teradata,  Aster,   Hadoop via  API  calls Logging Push  and  store  events  about  app  to   UDA  logging  services Ingest  /  Streaming   Stream  data  into  UDA  and  build   applications  on  near  real-­time  data Scheduling  /  Orchestration Scheduling  services  allow  devs to   build  workflows  and  connect  apps.   Search  &  Metadata Expose  search  capabilities  in  your   app  via  UDA  level  search  services. WebKit A  toolbox  of  UI  templates,   visualizations  and  javascript libraries Package/Deploy  &  Publish A  simple  package  and  deployment   application  to  launch  your  app  in  the   AppCenter ecosystem Operate Leverage  monitoring  &  alerting   services  to  keep  track  app  health. Key Services, Libraries & Templates UDA it a concept but also allows to be Development Platform
  11. 11. Instead of a single monolithic database 1 1 Monolith A monolithic application puts all of its functionality into a single process and scales by replicating the monolith on multiple servers. Microservices A microservices architecture puts each element of functionality into a separate service and scales by distributing these services across servers. Decoupled Services
  12. 12. Scale by distributing services and replicating as needed 1 2 Monolithic App A monolithic application puts all of its functionality into a single process and scales by replicating the monolith on multiple servers. Microservices A microservices architecture puts each elementoffunctionality into a separate service and scales by distributing these services across servers. Think Microservice, not Monolithic
  13. 13. ‹#› Access and move data between systems through service APIs 1 3 UDA TD TD INFRASTRUCTURE DATA SERVICES REST API Call Send Query Execute Query Send Response Teradata Data Services
  14. 14. QueryGrid – Data Movement QueryGrid – Remote Execution Foreign Table Select – Pass Thru SELECT * FROM FOREIGN TABLE ( select parse_url(refer,'HOST') as host, v.key as key, ts as session_ts, v.val, count(*) as count from http_inline LATERAL VIEW explode(str_to_map(parse_url(refer,'QUERY'),'&','=')) v as key, val where parse_url(refer, 'QUERY') is not null group by parse_url(refer, 'HOST'), v.key, v.val )@hdp21 hdp_dpi WHERE session_ts = current_date; Push foreign grammar to remote. Hadoop:permits Hive/Impala query for data reduction on non-partitioned columns. Import SELECT source, session FROM clickstream@Hadoop_sysd WHERE session_ts = ʻ‘2013-01-01ʼ’; Can be used to: – “Insert/select” & “create table as” to instantiate data locally. – Joins always possible with local tables. Export INSERTINTO cust_loc@Hadoop_sysd SELECTcust_id,cust_zip FROM cust_data WHERE last_update = current_date; Move Data from Teradata to Hadoop – And/or other Data Stores
  15. 15. 15 The Data Lake – Customer slide • This is not skating to where the puck is going to be - It’s skating to the puck. – Your CIO should be sitting you on the bench if you are not doing this already ​Most Data Lakes Today ​Passive cheap storage •Really only using HDFS •Limited data governance •Staging Data •Archiving Data •DW offload (cost drivers) ​The Data Lakes we Should be Building ​Active balanced nodes •Using full Hadoop stack+ •Good data governance •Good information architecture •Processing and enhancing data •Data applications (flexibility drivers)
  16. 16. 16 New Architecture Architecture • Information architecturesare distributed – Focus on data and business questions, not integrating separate systems • Application architectures are variable – Don’t force applications into a single architecture • Applications are Loosely Coupled – DW is an application – BI is an application (or many) – Data applications are everywhere! • But let’s be smart about it – Still need strong information architecture and data management practices – Still need to reduce complexity and make strategic choices on technology
  17. 17. 17 © 2015 Teradata Feature Store Application GeneratingAnalytical Features HDFS, Spark, ElasticSearch CampaignApplication Targeting Customers Teradata Customer Registry Application Registry Model of all Customers Teradata BI Application General Query andAccess to Features Teradata SQL, Microstrategy, Tableau Customer Matching Application Associating Customers Spark, Python, Scala, R Loosely Coupled Data Applications
  18. 18. 18 Customer– Framework Overview B2B Gateway SFTP SFTP SFTP Flume job Flume job Flume job Non-SLA Edge Flume DR Edge Flume SLA Edge Flume Non-SLA HDFS DR HDFS SLA HDFS Non-SLA HDFS DR HDFS SLA HDFS Non-SLA Teradata DR Teradata SLA Teradata External Systems Customer Gate Internal DAP Platform Edge Hadoop Raw Processing Hadoop Access Teradata Backup/Restore DistCP QueryQridQueryQrid
  19. 19. 19 © 2014 Teradata Customer - Microservices Example
  20. 20. ‹#› INFRASTRUCTURE QUERY GRID TD 6xxx TD 1xxx ASTER HADOOP 1 HADOOP 2 LISTENER Move data between systems & access through service APIs 2 0 App App App App AppApp Data Pipeline SERVICE LAYER
  21. 21. 21 Customer example – Integration Flow • User starts a Workflow fromthe UI which has a single Pig Job. • Azkaban Web requests that the Azkaban Executor start a new Pig Job. • Pig Job makes a REST call to the TemplateModule to render the Pig Template. • TemplateModule fetches config values from the ConfigModule if needed by the template. The ConfigModule in turn fetches config values either fromthe PCF Data Schema or from externalsystems. • TemplateModule renders the Pig Template and returns a complete Pig Script. • Pig Job executes the Pig Script against the Hadoop cluster. • During the Pig Job executionit makes REST calls to the EventModule informing about its progress. • As the Job progress is updated Vertx updates the Azkaban Web UI in real time. • When the Pig Job has completedit makes a REST call to the AuditModule to log its completion. The AuditModule in turn stores auditing information in the PCF Data Schema. • Finally the Pig Job returns its executionstatus back to the Azkaban Executor. MySQL Azkaban Web Azkaban Executor KAFKA Azkaban Bridge Service Config Service Template Service Teradata Service Event Service Audit Service Pig Job Hadoop PCF Pig Templ Pig Script JSON REST
  22. 22. 22 Customer – Event Flow KAFKA Flume Event Module Teradata Service JSON Azkaban Web Azkaban Executor HDFS Event Module HBase Event Module HCatalog Event Module REST hdfs.filelanded flume.filelanded hdfs.filelanded hbase.tabl emodified Hcatalog.n ewpartition Hcatalog.n ewpartition
  23. 23. 23 Customer – Docker services Azkaban Nginx Services LogStash Tessera /Graphite Consul Consul Consul Ambassadord Container Third Party Used For Nginx No Front end web server/proxy for all the other UIs. Vert.x No Application server. Azkaban Yes Workflow management for Hadoop, Teradata etc. Tessera/Graphite No Aggregating and displaying applications and system level metrics LogStash Yes Aggregating and displaying application and system level logs Consul Yes Distributed key value store used for Service Descovery Ambassadord Yes Makes it easier for Docker containers to access services hosted in other Docker containers
  24. 24. Tap into the power of the platform without duplicating effort YOUR ANALYTIC APP MICRO SERVICES FRAMEWORK ASTER DATA SERVICES …TD DATA SERVICES HIVE DATA SERVICES AUTH SERVICES Easily Access UDA
  25. 25. ‹#› Extract, Load & Transform in the Layered Architecture 2 5 Level 0 Aggregation Business Unit Specific Rollups Calculation Key Performance Indicators Level 3 Level 2 Level 1 Integration Integrated Model at Lowest Granularity Staging 1:1 Source Systems EXTRACT LOAD TRANSFORM APP CENTER LISTENER … BUSINESS HEALTH WORKLOAD ANALYTICS MEMBER SEGMENT ENGINE CATEGORY SALES DAILY FINANCIALS UDA & the LDA
  26. 26. Questions…
  27. 27. 27 THANK YOU

×