Data-in-Motion	Unleashed
Andrew	Psaltis
IoT Solutions	Architect
Hortonworks
Hortonworks	DataFlow	(HDF)
à The	HDF	– Data	in	Motion	
– Journey	at	Summit
à The	End-to-End	Data	in	
Motion	Challenge
à Introducing	Streaming	
Analytics	Manager	(SAM)
HDF
HORTONWORKS
DATAFLOW
The	HDF	Journey
STREAM
PROCESSING
HDF
Real-time	dashboard	
on	top	of	HDP
Hadoop	Summit
Enhanced	with	enrichment	
and	predictive	analytics
Hadoop	Summit
The	HDF	Journey
STREAM
PROCESSING
HDF
The	HDF	Journey
HDF
FLOW	
MANAGEMENT
All	about	Data	Collection
Hadoop	Summit
The	HDF	Journey
STREAM
PROCESSING
HDF
ENTERPRISE
SERVICES
FLOW	
MANAGEMENT
2017
HDF	3.0
End-to-end	solution	
for	data	in	motion
Dataworks Summit
Data	In	Motion	Across	the	Enterprise
DATA	
LIFECYCLE	
BEGINS
DATA	
VISUALIZATIONS	
&	OPERATIONS	
CONTROL
ROUTING,	
STREAM	PROCESSING	
&	ANALYTICS
Across	the	Enterprise	with	HDF
COLLECT	DATA	
AT	THE	EDGE
ANALYZE,	VISUALIZE,	
PREDICT,	
&	PRESCRIBE
ROUTE,	
PROCESS,
DELIVER,	
&	SYNDICATE
HDF
SECURITY
PROVENANCE
DATA
COMMAND	&
CONTROL
FLOW	
MANAGEMENT
STREAM	PROCESSING
&	ANALYTICS
Collect	Data	at	the	Edge
Collect	Data	at	the	Edge
SENT	SECURED	/	CONTROLLED	/	TRACKED
RECEIVED	FROM	VEHICLE	SENSORS
AQUIRE
PRIORITIZE
ANALYZE/ROUTE
TRANSMIT
HDF
Route,	Process,	Deliver	and	Syndicate
Apache	NiFi
Apache	Kafka
HDF
Route,	Process,	Deliver	and	Syndicate
Apache	NiFi
Apache	Kafka
Analyze,	Visualize,	Predict	and	Prescribe
HDF
Analyze,	Visualize,	Predict	and	Prescribe
HDF
End-to-End	Vision
à Intelligence	from	edge	to	core	
à Closed-loop	processing
à Fully	governed	and	secure
à Centralized	management/policy	with	distributed	execution
à Multi-tenant,	multi-organization	infrastructure
HDF:	Data-in-Motion	Platform
à Scalable	broker	for	streaming	apps
à Scale	out	computational	engine
à Complex	event	processing
à Pattern	matching
à Prescriptive/predictive	analytics
FLOW	MANAGEMENT STREAM	PROCESSING	/	
ANALYTICS
ENTERPRISE	SERVICES
▪ Acquisition	and	delivery
▪ Transformation,	filtering,	and	routing
▪ Simple	event	processing
▪ Complete	data	provenance
▪ Bi-directional	communication
▪ Provisioning	and	management
▪ Monitoring	and	security
▪ Auditing	and	compliance
▪ Governance	and	multi-tenancy
Find	more	#DWS17	sessions	and	slides	at:	
www.DataWorksSummit.com
18
T H A N K 	 Y O U

Data-In-Motion Unleashed

Editor's Notes

  • #2 TALKING POINTS Welcome to DataWorks Summit – Why name change? Thank community for their support Market & community growth perspective? Approaching 6 years of Hwx Scott intro
  • #3 I'm going to spend a little bit of time this morning walking you through the journey that we've taken over the last four years with the data in motion concept.   I’ll spend a few minutes after that walking you through the details of what we mean when we say data in motion. Many of you will hear quite a few buzzwords over the next few days and I want to make they land on a technical level.   And then we're going to spend the bulk of the time with me showing you in detail something we're very excited about - streaming analytics manager.
  • #4 In 2014 we came out and gave a keynote where we described how you can use HDP to do real time visualization of data. At that time there was a big shift happening where people understood what they could do in HDP, specifically to analyze data at scales they never could before. But they were starting to have much more interest in lower latency analysis. We presented how you can do this type of processing and analysis, however we totally hand waved on how the data gets there in the first place.
  • #5   Then in 2015 we came out and described how to enhance the data as it lands into the cluster. As it's arriving enrich it and make it more useful for analysis and visualization. Again still very much hand waving on how the data gets there in the first place.
  • #6   In 2016 we wanted to expand our view and we wanted to help customers and companies get much better at how they collect the data, drive it throughout the enterprise and deliver it to the cluster. This time though we hand waved about how you do stream processing on the data.
  • #7   This year we are very excited that we could come out and tell a truly balanced end to end story and show solutions for how you can collect, process, visualize and understand data in motion all the way through and that's what HDF 3.0 is all about.
  • #8 Data in motion across the enterprise starts at the edge. For some of you the edge maybe planes, trains and automobiles for others it's traditional enterprise assets; servers, workstations, laptops, network devices and so on. For us it's wherever the data life cycle begins, right at the first moment that it's created. From the very first observation we want to help you command and manage the data all the way through to the next hop whether that's a regional gateway, a core data center or the cloud. Full end-to-end processing of the data along it’s journey.   Then to be able to do stream processing as it arrives and provide powerful visualizations allowing you to interact with the data all while it's in motion. All of this in time for you to extract the maximum value from the data. As we all know in many cases the value of the data is perishable and reduces over time so we want to help you operate in time and on it in time.
  • #9 As you think about this this problem from end-to-end there are a few cross-cutting concerns that we have to make available all the way through the chain. First and foremost is security, as we move to the edge we step out of the cozy confines of the enterprise where we have LDAP, Active Directory, Kerberos and other technologies which often aren't available to us outside of the walls of the enterprise. We need to think about end to end security and we have to be able to shape shift so that we can use the best and most effective techniques all the way through. We also want to make sure that you understand the origin and attribution of every piece of data. Everywhere that it comes from and everywhere that it goes. Total lineage. Not just once it lands in the cluster but from the very moment it's created to all the systems that use it. Understanding latencies is also a very important part of that story and a big piece of the governance story. Finally a very critical part of our story is command and control. Building these things at scale is very difficult. Specifically it's difficult because once you get it set up you, as an organization want to make changes to it. It's not static, at least if you're doing it right it's not static, you need to be agile. Having a really powerful command control mechanism to allow that agility is a critical part of this story.
  • #10 Let me set the stage for the demo that I’m going to show you. Let’s imagine we are a transportation company that has a fleet of trucks driving around and we want to gather data from sensors on the vehicles. To do that we are going to use MiNiFi which is a sub project of Apache NiFi that we make available through HDF  
  • #11 With MiNiFi we'll acquire the data from a variety of sensors, do the initial analysis so we can understand the relative value of the data and prioritize it. Then we will use the most effective and most appropriate communication mechanism available to us. Maybe that's activating an LTE signal to send out really critical time sensitive data or maybe for data that has less time sensitivity we buffer it and send it out when we have a WiFi signal.
  • #12 The data then arrives into a more regional or core location and here we use technologies like Apache NiFi and Apache Kafka
  • #13 Perhaps we want to normalize the events, add customer reference data to the, tokenize, or further enrich them. We can then syndicate the events for a wide range of uses cases as well as drive them into systems like Hadoop and Spark
  • #14 Now I'm very excited to be able to tell you about what we can do with that data while it's still moving while it's still very fresh using the Streaming Analytics Manager
  • #15 With SAM we can do various things such as: perform window based processing temporal spatial correlation evaluate models aggregate enrich all with a schema applied to the data. Having a schema is critical when building a Streaming application, it allows you to see and understand how your data evolves. With SAM we're really targeting app developers, business analysts and operations teams so that we can help them maximize their individual user experience as necessary for their skill set and also give them a way to easily collaborate around a unified platform and do a better job than they ever have before. Let me now introduce you to SAM.
  • #16 We’ve articulated an end to end GA vision for flow management You can see where we’re heading Closed loop processing Fully Governed Support for multiple processing environments Incorporate machine learning and AI to optimize what is collected, minimize time to insight, etc. The path is truly excited and we look forward to working with all of you on this journey
  • #17 One of the things that I really love about all of this not only the easy by which developers can build streaming applications, but the power of allowing all the different actors to collaborate. I think you would agree we have a pretty important story and vision here about how we're going continue to bring these things together and further enhance the experience. In closing I want you to think about the fact that your enterprise is not composed of various big cool clusters of distributed systems it is one big logical distributed computing system and we want to help you drive that and help you maximize the value of data all the way through. Again we do that through providing solutions all the way from the edge to the core with flow management, stream processing and enterprise services. Enabling you to bring all of it together semantically through the registries and through a development and deployment experience that helps you harness the power of your data. thank you all very much for your time.