Successfully reported this slideshow.

Introduction to HDF 3.0

4

Share

1 of 33
1 of 33

More Related Content

More from Timothy Spann

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Introduction to HDF 3.0

  1. 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Timothy Spann 2017 Future of Data – Princeton Meetup June 20, 2017 Hosted by TRAC Intermodal Introduction to HDF 3.0
  2. 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved • Schema Registry – Milind Pandit • HDF Streaming Updates – Tim Spann • EDW Optimization with Hadoop and HDF - Gregory C Keys, PhD.
  3. 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Ambari Integration
  4. 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Format and Schema Aware Efficient Flow Management à Provide processors for schema aware record structure for common processing patterns – Split, Enrich, Partition, Convert, Query (SQL queries powered by Apache Calcite) – Put/Get records between NiFi and Kafka, ElasticSearch, RDMBS (more soon) – Easy bridging to/from Columnar data formats like ORC or Parquet à Separate format/schema specific logic into extensible record readers and writers – Developers can write new readers/writers – Users can create new readers/writers with scripting live in production! à So what? – Format and schema aware processing *with* generic reusable components – Maintains full provenance/lineage trail – Dramatic speed/efficiency increase per node – Integration with Hortonworks Schema Registry and extensible for others
  5. 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Record Reader CS
  6. 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Record Writer CS
  7. 7. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved ‘QueryRecord’ Processor – Treat streaming records as tables
  8. 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Component Versioning
  9. 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Processing – Introducing Streaming Analytics Manager (SAM) Streaming Analytics Manager A brand new product module in the HDF stack to design, develop, deploy and manage streaming analytics app with a drag-and-drop user experience
  10. 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM - Write Complex Streaming Applications With No Code Streaming Analytics Manager à A brand new product module in the HDF stack to design, develop, deploy and manage streaming analytics app with drag-and-drop paradigm – Build streaming analytics applications that do event correlation, context enrichment , complex pattern matching, analytical aggregations and creation of alerts/notifications when insights are discovered. – Give the coders the power to add key functions and extend the platform (add custom sinks, processors, spouts, etc..)
  11. 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM’s Value Proposition à Build and deploy complex stream applications without writing any code à Only open source tool in the market with graphical programming paradigm à Speed time-to-market to build complex streaming analytics applications à Build streaming analytics applications without specialized skillsets. à Decouple data format from the streaming application itself while being schema aware à Support multiple underlining streaming engines
  12. 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Builder Module for App Developers à Builder components, shown on the canvas palette, are the building blocks used by the app developer to build streaming applications à Drag and drop to build a working streaming application without writing a single line of code à 4 Types of Components: Sources, Processors, Sinks and Custom
  13. 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Insight Module for Business Analysts à A tool to create real-time analytics dashboards, charts and graphs à 30+ visualization charts out of the box with customization capability à Druid is the Analytics Engine that powers the Stream Insight Module.
  14. 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Ops Module for IT Operations à Create and manage different environments in which individual streaming applications will be built à Environments consists of services such as HDFS, Kafka, Storm from different service pools à Save time and reduce operational overhead with same drag and drop paradigm as the stream build module
  15. 15. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Builder Module for App Developers à Builder components, shown on the canvas palette, are the building blocks used by the app developer to build streaming apps. à Drag and drop to build a working streaming application without writing a single line of code. à 4 Types of Components: Sources, Processors, Sinks and Custom
  16. 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM is All about Doing Real-Time Analytics on the Stream Real-Time Prescriptive Analytics Real-Time Analytics Real-Time Predictive Analytics Real-Time Descriptive Analytics What should we do right now? What could happen now/soon? What is happening right now?
  17. 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Prescriptive Analytics à Question: What should we do right now? à Context: It is rainy, the driver is been on the road for 12 hours and he has 30 high speeding alerts over a 3 minute window in the last 2 hours. à Answer: Dispatch a radio call to the Driver to slow down
  18. 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Predictive Analytics à Question: No violation events but what might happen that I need to be worried about? à My data science team has a model that can predict that based on – Weather – Roads – Driver HR info like driver certification status, wagePlan – Driver timesheet info like hours, and miles logged over the last week
  19. 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Building the Predictive Model on HDP Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations” 1 Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data 2 Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format 3 Train a logistic classification Spark model on YARN, with above events as training input, and iterate to fine tune generated model 4
  20. 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Logistical Regression Model
  21. 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Scoring the Predictive Model on HDF Use SAM’s enrich/custom processors to enrich the event with the features required for the model6 Enrich with Features Use SAM’s projection/custom processors to transform/normalize the streaming event and the features required for the model 7 Transform/Normalize Use SAM’s PMML processor to score the model for each stream event with its required features8 Score Model Use SAM’s rule and notification processors to alert, notify and take action using the results of the model9 Alert / Notify / Action Export the Spark Mllib model and import into the HDF’s Model Registry 5 Model Registry
  22. 22. 22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM’s Model Registry and PMML Processor à Model Registry – Sam has repository to store and manage PMML based predictive models – First class features like version, evolution policies, etc, will be added in future release à PMML Processor – Processor that can use model from the registry and score the models based on the input stream of events coming in
  23. 23. 23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM Extensibility: Custom Processors, UDF, UDAFs à Custom Components – Most users will want to build custom components to meet certain requirements. – SAM provides the ability to add build custom component using the SAM SDK – The jars then can then be uploaded in SAM via the User Interface à 3 Types of Custom Components – Custom Processors – Custom UDF • User defined functions that are used by the Projection processor – Custom UDAFs • User defined aggregate functions that are used by the Aggregate processor. – SDK can be used to create custom UDF functions for windowed aggregations
  24. 24. 24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming Split Join Pattern à 3 Enrichments have to performed on the event stream to feed into model: – From Lat, Long and time, query weather conditions – From driverId, look up information about driver’s certification and wage plan – From driverId, look up information about how many miles and hours was on the driver on the road last week à Streaming Split Join Pattern – Complex Pattern that allows parallel processing to decrease latency (Used by Apache Metron extensively) 1. Create a splitJoin Key 2. Split the stream into n where n is the number of different enrichments you want to do 3. Join the n streams based on the splitJoinKey Complex pattern to implement that SAM allows the user to do simply with no code!
  25. 25. 25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Insight Module for Business Analysts à A tool to create time-series and real-time analytics dashboards, charts and graphs à 30+ visualization charts out of the box with customization capability à Druid is the Analytics Engine that powers the Stream Insight Module.
  26. 26. 26 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming Analytics Manager
  27. 27. 27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Set Up An Environment for SAM
  28. 28. 28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hortonworks SAM Canvas to build the Streaming Analytics App without writing a line of code
  29. 29. 29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hortonworks SAM App Dashboard
  30. 30. 30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Schema Registry Dashboard and Details of One Schema
  31. 31. 31 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Contact: Timothy Spann @PaaSDeV www.meetup.com/futureofdata-princeton community.hortonworks.com/users/9304/tspann.html
  32. 32. 32 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  33. 33. 33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 4,000+ Registered Users 10,000+ Answers 15,000+ Technical Assets One Website!

×