The document discusses Cloudera Fast Forward Labs and how it can help organizations accelerate their machine learning and data strategies. It provides research, advising, and application development services to help clients stay on top of emerging technologies, define optimal data strategies, and evaluate machine learning capabilities. Cloudera Fast Forward Labs aims to be organizations' partner for creating and executing excellent data strategies.
Welcome to Cloudera Sessions! My name is JJ Sakeyand I have the honor of taking us through today’s jam-packed program. By way of introduction, {tell us about what you do JJ}.
We are all here today because we believe that data can make what is impossible today possible tomorrow. Certainly many of us in this room have already created board level impacts to our business. We are going to have two customers speak to you today about their data journeys –Sentier, Amazon Web Services, and Altisource. There is no question that big data is improving insights into customers, it’s connecting products and services via IOT and its protecting your business from cyber attacks and regulatory fines. But it’s also quietly having a major social impact that affects every one of us in this room. For example, 4 out of the 5 cancer research centers are using Cloudera to find a cure, and when we do the odds are good that it was done with our software and big data. Cloudera has partnered with many hospitals and saved hundreds of lives already by early detection of sepsis. Lastly, a topic near to my heart, we have partnered with several non-profit groups to detect early signs of suicide, especially for veterans, which is the one of the leading cause of death.
So, thank you for spending the day to talk about Big Data, that is having a profound effect in business and our lives.
Let’s jump in and cover a couple important logistics items first…
What is artificial intelligence? What is machine learning? Popular media suggests that AI is all about recognizing pictures of cats and dogs, or machines beating human in the game of Go. But when we peel away the hype, machine learning of today is really a very smart pattern recognizer. How do we leverage this capability and tum it into a competitive advantage? The tech stack looks like this. First, there is data. then we need to have the capability to build basic understanding of the data (this is the analytics layer). At this layer, we are able to say things like, “the average age of my customer is 40”. Naturally, we would like our data to tell us more – and this is when we move into the data science layer. Both analytics and Data Science are built on top of the big data layer, but the data requirement is tighter than the analytics layer. At this layer, we start to focus more on data cleaning and preping. Here, we are able to answer questions like “How much sales do I expect to generate next year”. To answer more sophisticated questions, we move into the ML layer. The ML layer puts a lot of focus on algorithms and can only work if the organization has the know-how of the lower stacks.
Analytics – descriptive stats, visualize data
Data Science – data cleaning, prep, analyze (forecast)
ML - algorithms
Machine learning will transform businesses. It is a huge opportunity for every company, but it is hard to execute on. Companies often do not know what questions to ask, and what problems to focus on. Even if they have converged on a problem, they soon realize that …
… and that no software can solve their problem.
Successful data products are often a clever combination of known components, machine learning tools and algorithms, applied to a well understood problem.
Building the right data product requires both strategy and technology to be properly aligned. You cannot build a data product independently because business strategy dictates data availability. In many cases data opportunities require optimizing over both business needs and technological capability and can also require organizational transformation.
And this is where we come in –
Combined with Cloudera’s Enterprise Data Hub and Data Science Workbench, our goal is to accelerate machine learning in the enterprise, from research to production.
We sit at the intersection of three entities.
What we try to to is to build a bridge to connect academic research and enterprise – in a way, extract and present information from academic research such that businesses can make use of it. We also intersect startups because we find that it’s a helpful window into what businesses are looking for.
We live at and our team has experience at the intersection of startup culture (agility, novelty, speed), academic research (where new algorithmic ideas come from), and the enterprise (opportunity to execute at scale, unique data). We’ve been doing this for 5 years (cf. Amazon, MSFT, Google, etc.)
Academic research doesn’t focus on valuable business problems.
Startups generally don’t invent new technology.
Corporate R&D struggles to align with business priorities and effectively execute.
What makes us unique? We engage in 3 ways – research subscription, advising and application development.
We use research to help clients stay on top of emerging ML technologies. Every quarter we release a research report focusing on the new capability/breakthrough that we believe will become important in a next 6 months-2 years.
The second way we engage is through advising, where we help define data strategy and evaluate ML capabilities. Our research subscription comes with 4 hours of advising/month. This is tailored for each client and every client utliizes this time differently. As an example, clients have used the time to do a deeper dive into our reports, to help identify data assets, to help guide their ML product development and to develop strategic and technical roadmaps.
Lastly, for some clients with very specific projects in mind, but who are unsure of whether they have the resources to succeed, we help transform these science experiments into actual products by performing feasibility studies. The deliverables are proof of concept code and extensive documentation of what worked and didn’t work. In the end clients get a piece of working code, tailored to their problem and data, that you own and that you can build on top of.
Here are all the reports we have done in the past.
In choosing a breakthrough topic, we use answers to the following questions as a guide: 1) Is it useful? 2) Can we build a prototype 3) Is it timely? Purely algorithmic breakthroughs are not interesting to the business community unless they have specific applications. One way to ensure that is to filter for ones where a product prototype that depends on the algorithm can be built. Finally, the breakthrough has to be timely. It has to be more possible now than it was 1-2 years ago, and we expect it to be even more possible in 1-2 years more. We predict timeliness using two gauges i) economic constraints and ii) commoditization of tools. Sudden lifting of economic constraints can make what were previously nice ideas practical while commoditization of tooling makes it quicker to build things that were possible, but were difficult to get right and time consuming. Deep learning’s acceleration by GPUs and Keras/Tensorflow clearly illustrate both aspects.
In our latest report on semantic recommendations, we look at the state of recommendation systems and their common pitfalls. Recommendation systems have been around for many years and businesses rely on them to surface interesting items for end-users. Unfortunately classical recommendation systems do not understand what they are recommending. Things are recommended to you because others similar to you have liked them. In our report, we look at ways to inject content of items into the system. When we do this, we are building a recommendation system that understands user preferences as it relates to item content. Turns out this technique also solves the cold start problem – this is a common problem in classical recommendation systems where the system doesn’t know how to generate recommendations for new items.
In the interpretability report, we look at ways to understand and explain how a model makes decisions. Interpretability is important not just for regulatory reasons, Being able to explain why and how a model works can help us improve models and build a better product. Black box techniques like deep learning delivers breakthrough capabilities at the cost of interpretability – in this report, we show how to make models interpretable without sacrificing their capability or accuracy.
If your model is accurate, but you have no idea how it works, what are you missing? Turns out quite a lot! It’s easier to improve an interpretable model. The ability to explain individual decisions to their subjects is intrinsically useful. People like to know why a model has treated them a certain way. And in many cases there’s an ethical and/or legal duty to ensure models are safe and non-discriminatory, which can only be done if they are interpretable. A paper published in 2016 made this report possible, by releasing a algorithm called LIME to probe the inner workings of a black box model.
Text summarization. This report looks at a specific and very practical problem: summarizing documents. We show how to do that using the latest and greatest ideas from deep learning and topic modeling. But because text summarization is just a special case of a much broader set of problems — how can we help computers work with natural language — it’s a report with much wider implications, for any of us who work with text, either consuming or generating it.
Next, probabilistic programming. The conclusions you draw from imperfect or incomplete data are uncertain. This report is all about how you work with that. Academic statisticians have known the how to deal with this uncertainty for a long time, but it’s only in the past few years that the algorithms have caught up with the scale of big data, and only very recently that tools and have made these algorithms accessible.
In our deep learning report, we look at how neural networks enable us to analyze images. We explain what neural networks are, and how we can apply deep learning today.
In all our reports, we first begin with the gentle introduction of the capability We then move on with a rigourous but conceptual discussion of the state of the art algorithm. We also describe the prototype, and the process of building it.
For clients who are interested in implementing the new capability, we dedicate a chapter to commerical and open source landscape that will hopefully help with the buy or build decision.
Because the focus is on business applications, each report also has a chapter on ethics.
We close with a sci-fi short story - mostly to get readers to imagine in a very unconstrained way, what the capability can do for their businesses.
With all that in mind, let’s take a look at a couple of the reports. First, text summarization. This report looks at a specific and very practical problem: summarizing documents. We show how to do that using the latest and greatest ideas from deep learning and topic modeling. But because text summarization is just a special case of a much broader set of problems — how can we help computers work with natural language — it’s a report with much wider implications, for any of us who work with text, either consuming or generating it.
How do you take a long document and make it shorter?
More generally, how do you make language computable
We describe single and multiple document summarization using:
topic models (a mature, accessible approach)
language embeddings and recurrent neural networks(a cutting-edge deep learning approach)
Next, let’s look at our interpretability report.
In the interpretability report, we look at ways to understand and explain how a model makes decisions. Interpretability is important not just for regulatory reasons, Being able to explain why and how a model works can help us improve models and build a better product. Black box techniques like deep learning delivers breakthrough capabilities at the cost of interpretability – in this report, we show how to make models interpretable without sacrificing their capability or accuracy.
If your model is accurate, but you have no idea how it works, what are you missing? Turns out quite a lot! It’s easier to improve an interpretable model. The ability to explain individual decisions to their subjects is intrinsically useful. People like to know why a model has treated them a certain way. And in many cases there’s an ethical and/or legal duty to ensure models are safe and non-discriminatory, which can only be done if they are interpretable. A paper published in 2016 made this report possible, by releasing a algorithm called LIME to probe the inner workings of a black box model.
Interpretable models are easier to improve
Regulators and society can better trust them to be safe and nondiscriminatory
They offer insights that can be used to change real-world outcomes for the better
We describe the Local Interpretable Model-Agnostic Explanation (LIME) algorithm
To illustrate the capability, we built a prototype where we model the likelihood of a customer churning. Without interpretability, all the model gives us is the probability that a customer will churn. As an example, we see here that customer iD 3676 has a 79% chance of churning.
When we add interpretability to the model by using LIME, we are now able to see why a customer is assigned a particular churn probability. The factors are color coded – redder means that LIME has assigned higher importance to this factor.
Using LIME, we are able to say that the 79% churn probability is mostly caused by three factors – the fact that they have Fiber, and that their contract is month-to-month and that the customer is new.
Cloudera helps scale data science and ML:
Cloudera acclerates machine learning in the enterprise, from reasearch to production.
We address uncertainty with FFL research and advising that cuts through the hype
We address data silos issues with our enterprise data hub that unifies collection, access and deployment with shared security and governance
Lastly, our Data Science Workbench makes collective, secure data science at scale a reality for the enterprise.
SDX: shared data services (ALTUS)
Cloudera Altus lets you automate massive-scale data engineering and analytic database compute workloads in your public cloud, without the headache of managing the infrastructure yourself. At the core of Altus is Cloudera's Shared Data Experience (SDX) that eliminates data silos with persistent metadata, security, and governance.