Welcome to DataWorks Summit – Why name change? Thank community for their support Market & community growth perspective? Approaching 6 years of Hwx Scott intro
I'm going to spend a little bit of time this morning walking you through the journey that we've taken over the last four years with the data in motion concept.
I’ll spend a few minutes after that walking you through the details of what we mean when we say data in motion. Many of you will hear quite a few buzzwords over the next few days and I want to make they land on a technical level.
And then we're going to spend the bulk of the time with me showing you in detail something we're very excited about - streaming analytics manager.
In 2014 we came out and gave a keynote where we described how you can use HDP to do real time visualization of data.
At that time there was a big shift happening where people understood what they could do in HDP, specifically to analyze data at scales they never could before. But they were starting to have much more interest in lower latency analysis.
We presented how you can do this type of processing and analysis, however we totally hand waved on how the data gets there in the first place.
Then in 2015 we came out and described how to enhance the data as it lands into the cluster. As it's arriving enrich it and make it more useful for analysis and visualization. Again still very much hand waving on how the data gets there in the first place.
In 2016 we wanted to expand our view and we wanted to help customers and companies get much better at how they collect the data, drive it throughout the enterprise and deliver it to the cluster. This time though we hand waved about how you do stream processing on the data.
This year we are very excited that we could come out and tell a truly balanced end to end story and show solutions for how you can collect, process, visualize and understand data in motion all the way through and that's what HDF 3.0 is all about.
Data in motion across the enterprise starts at the edge. For some of you the edge maybe planes, trains and automobiles for others it's traditional enterprise assets; servers, workstations, laptops, network devices and so on.
For us it's wherever the data life cycle begins, right at the first moment that it's created. From the very first observation we want to help you command and manage the data all the way through to the next hop whether that's a regional gateway, a core data center or the cloud. Full end-to-end processing of the data along it’s journey.
Then to be able to do stream processing as it arrives and provide powerful visualizations allowing you to interact with the data all while it's in motion.
All of this in time for you to extract the maximum value from the data. As we all know in many cases the value of the data is perishable and reduces over time so we want to help you operate in time and on it in time.
As you think about this this problem from end-to-end there are a few cross-cutting concerns that we have to make available all the way through the chain.
First and foremost is security, as we move to the edge we step out of the cozy confines of the enterprise where we have LDAP, Active Directory, Kerberos and other technologies which often aren't available to us outside of the walls of the enterprise.
We need to think about end to end security and we have to be able to shape shift so that we can use the best and most effective techniques all the way through.
We also want to make sure that you understand the origin and attribution of every piece of data. Everywhere that it comes from and everywhere that it goes. Total lineage. Not just once it lands in the cluster but from the very moment it's created to all the systems that use it. Understanding latencies is also a very important part of that story and a big piece of the governance story.
Finally a very critical part of our story is command and control. Building these things at scale is very difficult. Specifically it's difficult because once you get it set up you, as an organization want to make changes to it. It's not static, at least if you're doing it right it's not static, you need to be agile. Having a really powerful command control mechanism to allow that agility is a critical part of this story.
Let me set the stage for the demo that I’m going to show you.
Let’s imagine we are a transportation company that has a fleet of trucks driving around and we want to gather data from sensors on the vehicles.
To do that we are going to use MiNiFi which is a sub project of Apache NiFi that we make available through HDF
With MiNiFi we'll acquire the data from a variety of sensors, do the initial analysis so we can understand the relative value of the data and prioritize it. Then we will use the most effective and most appropriate communication mechanism available to us. Maybe that's activating an LTE signal to send out really critical time sensitive data or maybe for data that has less time sensitivity we buffer it and send it out when we have a WiFi signal.
The data then arrives into a more regional or core location and here we use technologies like Apache NiFi and Apache Kafka
Perhaps we want to normalize the events, add customer reference data to the, tokenize, or further enrich them.
We can then syndicate the events for a wide range of uses cases as well as drive them into systems like Hadoop and Spark
Now I'm very excited to be able to tell you about what we can do with that data while it's still moving while it's still very fresh using the Streaming Analytics Manager
With SAM we can do various things such as: perform window based processing temporal spatial correlation evaluate models aggregate enrich all with a schema applied to the data.
Having a schema is critical when building a Streaming application, it allows you to see and understand how your data evolves.
With SAM we're really targeting app developers, business analysts and operations teams so that we can help them maximize their individual user experience as necessary for their skill set and also give them a way to easily collaborate around a unified platform and do a better job than they ever have before.
Let me now introduce you to SAM.
We’ve articulated an end to end GA vision for flow management
You can see where we’re heading Closed loop processing Fully Governed Support for multiple processing environments Incorporate machine learning and AI to optimize what is collected, minimize time to insight, etc.
The path is truly excited and we look forward to working with all of you on this journey
One of the things that I really love about all of this not only the easy by which developers can build streaming applications, but the power of allowing all the different actors to collaborate. I think you would agree we have a pretty important story and vision here about how we're going continue to bring these things together and further enhance the experience.
In closing I want you to think about the fact that your enterprise is not composed of various big cool clusters of distributed systems it is one big logical distributed computing system and we want to help you drive that and help you maximize the value of data all the way through.
Again we do that through providing solutions all the way from the edge to the core with flow management, stream processing and enterprise services. Enabling you to bring all of it together semantically through the registries and through a development and deployment experience that helps you harness the power of your data.
thank you all very much for your time.
IoT Solutions Architect
Hortonworks DataFlow (HDF)
Ã The HDF – Data in Motion
– Journey at Summit
Ã The End-to-End Data in
Ã Introducing Streaming
Analytics Manager (SAM)
The HDF Journey
on top of HDP
Enhanced with enrichment
and predictive analytics
The HDF Journey
The HDF Journey
All about Data Collection
The HDF Journey
for data in motion
Data In Motion Across the Enterprise
Across the Enterprise with HDF
AT THE EDGE