Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and cyber security, how we preserve privacy whilst exploiting the advantages of data collection and processing. Big Data technologies provide both governments and corporations powerful tools to offer more efficient and personalized services. The rapid adoption of these technologies has of course created tremendous social benefits. Unfortunately unwanted side effects are the potential rich pickings available to those with malicious intentions. Increasingly, the sophisticated cyber attacker is able to exploit the rich array public data to build detailed profiles on their adversaries to support their malicious intentions
19. 19
Machine Learning
19
Real-time large-scale
machine learning predictive
analytics infrastructure build
on Hadoop
• Collaborative filtering and
recommendation
• Classification and
regression,
• Clustering
Editor's Notes
Data is valuable both as asset and to your customers. As the guardians of your customers’ data, you provide services using that data such bank accounts and online tax disks. Of course you need to defend that data on your customer behalf if you want to maintain their loyalty. This talk will explore how using Cloudera’s Enterprise Data Hub you can do that, but also how you can use this technology to also play some offence and use the immense computational power to evaluate how you customer are being subject to cyber attacks and how you can help them.
In the same way that data is indicative to business about purchase behavior and intent. So it is valuable to the bad guys whether to damage reputation, or simply to trade. The bad guys have the advantage of being able to aggregate from numerous data sources without worrying about regulation other than getting caught. As business moves their assets and knowledge capital online, these asset are increasingly spread throughout the supply chain business. For large enterprises protective this supply chain is challenging especially where it is outsourced
Multi-tenant secure clusters running EDH could be the solution, resources are pooled together to create capability whereby all of the instrumentation and data assets are stored in the same data lake or reservoir, partitioned by robust security.
Let’s take a look at some typical security layers that are used to protect these assets.
Cloudera Enteprise Data Hub provides enterprise class security for Hadoop to specifically to enable complex and challenging regulatory workloads. Incorporating many upstream features from Intel’s project Rhino including encryption at rest and in motion with hardware enhanced performance, better use of role based access control, high levels of granularity such as cell level access control in Hbase and end to end audit compliance.
YARN Static and Dynamic resource pools restrict resource utilization in a shared multi-tenant environment, thus contributing to availability of the cluster
Encryption ensures the integrity and indeed the confidentiality of the data
All communications including remote procedure calls between nodes for are authorized with a valid ticket. The KDC may feature a one way trust with the corporate directory or indeed be fully integrated using SSSD
Role based access control to underlying data facilitate multi tenant (within the Enterprise) access to data
Tracking the provenance of your data, throughout storage and processing chain is vital particularly if that data is subject to compliance regulation such as PCI
Why you need Navigator:
Lots of Data Landing in Cloudera Enterprise
Huge quantities
Many different sources – structured and unstructured
Varying levels of sensitivity
Many Users Working with the Data
Administrators and compliance officers
Analysts and data scientists
Business users
Need to Effectively Control and Consume Data
Get visibility and control over the environment
Discover and explore data
Encryption in motion, SSL enabled for services with authenticated RPC calls on the cluster. The key trustee server can be integrated with existing HSMs in order that the master encryption keys be both tamper proof and revokable and work with existing key management policies. The access controls are processed based which effectively prevents a root user access to the unencrypted contents of a file. An important and valuable separation of duties.
Our design strategy is to tightly integrate different processing paradigms into the Hadoop system. Resources are pooled enable different computation workloads such Map Reduce and Impala to utilize common infrastructure. Interactive SQL, batch processing whether map reduce, spark or stream processing such as Spark streaming are just another applications that you bring to your data. These are integrated with Hadoop’s existing security and resource management frameworks and is completely interoperable with existing data formats and processing engines such as Map Reduce.
One pool of data
Storage platforms (HDFS & HBase)
Open data formats (files & records)
Shared across multiple processing frameworks
One metadata model
No synchronization of metadata between 2 different systems (analytical DBMS and Hadoop)
Same metadata used by other components within Hadoop itself (Hive, Pig, Impala, etc.)
One security framework
Single model for all of Hadoop
Doesn’t require “turning off” any portion of native Hadoop security
One set of system resources
One set of nodes – storage, CPU, memory
One management console
Integrated resource management
Scale linearly as capacity or performance needs grow
The Enterprise data hub infrastructure can support an array of user cases that would otherwise be locked in expensive limited capability silos. Those user cases can be applied to the full data set more productively, at lower costs. As a result the economics facilitate the overall capability to ask those bigger questions. These user cases apply across domains encompassing management, security, HR and business intelligence.
Complex Map Reduce jobs are often a chained series of task that involve Maps Reduce Maps Maps Reduce and so on. Apache Spark significantly simplifies the coding of these complex pipelines with a common API for both batch and streaming the programmer can then explicitly write to disk at the most optimum time
Enterprises are increasingly using Hadoop and the economics of BigData to drive efficiencies in the way they provide and consume IT services. BigData economics allow the entirety of both the structured management metrics from IT infrastructure to be combined with the unstructured supporting commentary.
This allows for new types of exploitation such as machine learning and predictive analysis. The innovation begins with continuously ingesting the metrification and supporting commentary that is describing the current performance. Discovery evaluates the historical patterns of performance that build up over time using machine learning to construct a model. These patterns in turn provide the insights into the predictions that those signals often illustrate. Cases include variations in efficiencies of manufacturers disks for variable such as power consumption, developer team code performance, impact of training and certification. All of which enables further innovations and gains based on facts.
Flume
A resilient framework for delivering event data to the Hadoop cluster. Sources, Channels and Sinks
Kite is a set of libraries tools and features to build Hadoop applications
Morphlines provides configuration driven tools that can extract facets using interceptors on the ingestion pipeline to enrich with meta data records
In this sample all of the Apache web server logs are filtered for http 408 errors. The faceting by country using geoip lookup helps identify the source of the DDOS
Slowloris is an Old DDOS trick whereby a web client very slooowly makes a connection to the web server, assuming Apache is patched the slowloris is revealed by filtering on the 408 errors.
It can continuously build models from a stream of data at large scale using Apache Hadoop. It also serves queries of those models in real-time via an HTTP REST API, and can update models approximately in response to streaming new data. This two-tier design, comprised of the Computation Layer and Serving Layer, respectively, implement a lambda architecture. Collaborative filtering works like people that search for this search for that. Collaborative filtering is a form of supervised learning, where a value is predicted for new inputs based on known values for previous inputs often used for Spam filers. Clustering will group using algorithms such as Kmeans based on common features.
Vectorising using TF-IDF for term frequency inverse document frequency which infers how important a word might be in a document. These can then be classified using an algorithm such as Naïve bayes
Useful to extract as a feature that can then be clustered using Kmeans across a corpus of documents. Often used by search engines to score and rank documents according to a query. So for example stream of data from a Twitter channel sharing a hash tag
Choices Oryx 1 Map Reduce based, Oryx 2 Spark based and Spark Mlib
Doing so in memory on Spark is good for iterative algorithms avoid the need to materialize the data and jobs such as monte carlo simulations