Learn how you can use Cloudera Impala to:
- Operate with all data in your domain
- Address cyber security analysis and forensics needs
- Combat fraud, waste, and abuse
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for CybersecurityCloudera, Inc.
Chief Architect of Cloudera Government Solutions, Joey Echeverria, shares knowledge about Hadoop cybersecurity and the pieces of Cloudera's Enterprise Data Hub that address cybersecurity.
Data is being generated at a feverish pace and many businesses want all of it at their disposal to solve complex strategic problems. As decision making moves to real-time, enterprises need data ready for analysis immediately. Sean Anderson and Amandeep Khurana will discuss common pipeline trends in modern streaming architectures, Hadoop components that enable streaming capabilities, and popular use cases that are enabling the world of IOT and real-time data science.
How does SolrCloud ensure that replicated data remains consistent? How does Solr avoid data loss when hardware inevitably fails? In this talk, we will cover how Solr addresses failures and what recovery steps the cluster can automatically perform.
Using Hadoop to Drive Down Fraud for TelcosCloudera, Inc.
Communication Service Providers (CSPs) lose around $38 Billion to fraud every year. Check out this webinar to learn more about the Cloudera - Argyle Data real-time fraud analytics platform and how Telcos can utilize Apache Hadoop to drive down fraud.
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark.
With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end.
Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss:
An Introduction to Spark and its advantages over MR
An introduction of Hive-on-Spark: Goals and Design Principles
Migrating to HoS and a live demo
Configuring and tuning for batch workloads
What’s next for both tools
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
Risk Management for Data: Secured and GovernedCloudera, Inc.
Cloudera Tech Day Presentation by Eddie Garcia, Chief Security Architect, Cloudera. Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.
Data Science and Machine Learning for the EnterpriseCloudera, Inc.
Overview of Machine Learning and how the Cloudera Data Science Workbench provides full access to data while supporting IT SLAs. The presentation includes details on Fast Forward Labs and The Value of Interpretability in Models.
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for CybersecurityCloudera, Inc.
Chief Architect of Cloudera Government Solutions, Joey Echeverria, shares knowledge about Hadoop cybersecurity and the pieces of Cloudera's Enterprise Data Hub that address cybersecurity.
Data is being generated at a feverish pace and many businesses want all of it at their disposal to solve complex strategic problems. As decision making moves to real-time, enterprises need data ready for analysis immediately. Sean Anderson and Amandeep Khurana will discuss common pipeline trends in modern streaming architectures, Hadoop components that enable streaming capabilities, and popular use cases that are enabling the world of IOT and real-time data science.
How does SolrCloud ensure that replicated data remains consistent? How does Solr avoid data loss when hardware inevitably fails? In this talk, we will cover how Solr addresses failures and what recovery steps the cluster can automatically perform.
Using Hadoop to Drive Down Fraud for TelcosCloudera, Inc.
Communication Service Providers (CSPs) lose around $38 Billion to fraud every year. Check out this webinar to learn more about the Cloudera - Argyle Data real-time fraud analytics platform and how Telcos can utilize Apache Hadoop to drive down fraud.
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark.
With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end.
Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss:
An Introduction to Spark and its advantages over MR
An introduction of Hive-on-Spark: Goals and Design Principles
Migrating to HoS and a live demo
Configuring and tuning for batch workloads
What’s next for both tools
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
Risk Management for Data: Secured and GovernedCloudera, Inc.
Cloudera Tech Day Presentation by Eddie Garcia, Chief Security Architect, Cloudera. Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.
Data Science and Machine Learning for the EnterpriseCloudera, Inc.
Overview of Machine Learning and how the Cloudera Data Science Workbench provides full access to data while supporting IT SLAs. The presentation includes details on Fast Forward Labs and The Value of Interpretability in Models.
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
Cloudera Altus makes it easier for data engineers, ETL developers, and anyone who regularly works with raw data to process that data in the cloud efficiently and cost effectively. In this webinar we introduce our new platform-as-a-service offering and explore challenges associated with data processing in the cloud today, how Altus abstracts cluster overhead to deliver easy, efficient data processing, and unique features and benefits of Cloudera Altus.
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
Cloudera Navigator Optimizer analyzes existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop.
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
Federal organizations increasingly are focused on creating environments that enable more data-driven decisions. Yet ensuring that all data is considered and is current, complete, and accurate is a tall order for most. To make data analytics meaningful to support real-world transformation, agency staff need business tools that provide user-friendly dashboards, on-demand reporting, and methods to manage efficiently the rise of voluminous and varied data sets and types commonly associated with big data. In most cases, existing systems are insufficient to support these requirements. Enter the enterprise data hub (EDH), a software architecture specifically designed to be a unified platform that can economically store unlimited data and enable diverse access to it at scale. Plan to attend this discussion to understand the key considerations to making an EDH the architectural center of your agency’s modern data strategy.
A Community Approach to Fighting Cyber ThreatsCloudera, Inc.
3 Things to Learn About:
*Infinitely scale data storage, access, and machine learning
*Provide community defined open data models for complete enterprise visibility
*Open up application flexibility while building on a future proofed architecture
Seeking Cybersecurity--Strategies to Protect the DataCloudera, Inc.
Agency professionals are responsible for protecting the data they collect, store, analyze, and share. While Hadoop has been especially popular for data analytics given its ability to handle volume, velocity, and variety of data, this flexibility and scale can present challenges for securing and governing the data. Plan to attend this session to understand the Hadoop Security Maturity Model—from the fundamentals to the latest developments--and how to ensure your data analytics cluster complies with the latest INFOSEC standards and audit requirements. Bring your experience and your questions to this informative and interactive cybersecurity session.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
Cloudera Enterprise can be used as an adaptive, high-performance analytic database, complementing existing data warehouses by relieving the pressure of growing numbers of ETL jobs and BI analytics. But where do you get started when developing your offload strategy? How can you identify which workloads are the best fit for which system? And once you’re up and running, how can you constantly adapt to Hadoop’s changing data needs?
Cloudera Navigator Optimizer eases the path for moving the right workloads to Hadoop and then actively manages data allowing you to take advantage of Hadoop’s benefits. Now generally available with the recent release of Cloudera 5.8 and a unique part of Cloudera’s analytic database solution, Navigator Optimizer gives you the workload visibility and assessments to build a predictable offload plan, adapt to evolving data and workload demands, and optimize query performance for Hadoop technologies
3 Things to Learn:
Join Ewa Ding, Senior Product Manager at Cloudera, as she discusses:
-An overview of Cloudera Navigator Optimizer and its key features
-A live demo and key use cases of this web-based tool
-What’s next for active data optimization in Hadoop
Cloudera Tech Day Presentation by Eva Andreasson, Director Product Management, Cloudera.
Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Discover the origins of big data, discuss existing and new projects, share common use cases for those projects, and explain how you can modernize your architecture using data analytics, data operations, data engineering and data science.
Big Data Fundamentals is your prerequisite to building a modern platform for machine learning and analytics optimized for the cloud.
We’ll close out with a live Q&A with some of our technical experts as well.
Stretch your brain with a packed agenda:
Open source software
Data storage
Data ingestion
Data analytics
Data engineering
IoT and life after Lambda architectures
Data science
Cybersecurity
Cluster management
Big data in the cloud
Success stories
Topics including: The transformative value of real-time data and analytics, and current barriers to adoption. The importance of an end-to-end solution for data-in-motion that includes ingestion, processing, and serving. Apache Kudu’s role in simplifying real-time architectures.
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
We hope this session was valuable in teaching you more about Cloudera Enterprise on AWS, and how fast and easy it is to deploy a modern data management platform—in your cloud and on your terms.
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
Learn about the skills and tools a data scientist needs and how to start training to be one.
There's so much noise about what a data scientist is or isn't that it can be challenging to identify the skills needed to start training a team or becoming one yourself. What exactly is a data scientist and where do you start?
Cloudera's Director of Data Science, Sean Owen, will start by walking through the different skills data scientist should have and why businesses need them. Afterwards, Tom Wheeler, Cloudera's Principal Curriculum Developer, will introduce the latest data science course developed by Cloudera University designed to help people take their first steps to becoming a data scientist.
Self-service Big Data Analytics on Microsoft AzureCloudera, Inc.
In this presentation Microsoft will join Cloudera to introduce a new Platform-as-a-Service (PaaS) offering that helps data engineers use on-demand cloud infrastructure to speed the creation and operation of data pipelines that power sophisticated, data-driven applications - without onerous administration.
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
One benefit of Apache Hadoop is the ability to power multiple workloads, across many different users and departments, all within a single, shared cluster. Hear how BT is doing this today and learn about new features in Cloudera Manager to provide better visibility for multi-tenant operations.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
Cloudera Altus makes it easier for data engineers, ETL developers, and anyone who regularly works with raw data to process that data in the cloud efficiently and cost effectively. In this webinar we introduce our new platform-as-a-service offering and explore challenges associated with data processing in the cloud today, how Altus abstracts cluster overhead to deliver easy, efficient data processing, and unique features and benefits of Cloudera Altus.
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
Cloudera Navigator Optimizer analyzes existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop.
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
Federal organizations increasingly are focused on creating environments that enable more data-driven decisions. Yet ensuring that all data is considered and is current, complete, and accurate is a tall order for most. To make data analytics meaningful to support real-world transformation, agency staff need business tools that provide user-friendly dashboards, on-demand reporting, and methods to manage efficiently the rise of voluminous and varied data sets and types commonly associated with big data. In most cases, existing systems are insufficient to support these requirements. Enter the enterprise data hub (EDH), a software architecture specifically designed to be a unified platform that can economically store unlimited data and enable diverse access to it at scale. Plan to attend this discussion to understand the key considerations to making an EDH the architectural center of your agency’s modern data strategy.
A Community Approach to Fighting Cyber ThreatsCloudera, Inc.
3 Things to Learn About:
*Infinitely scale data storage, access, and machine learning
*Provide community defined open data models for complete enterprise visibility
*Open up application flexibility while building on a future proofed architecture
Seeking Cybersecurity--Strategies to Protect the DataCloudera, Inc.
Agency professionals are responsible for protecting the data they collect, store, analyze, and share. While Hadoop has been especially popular for data analytics given its ability to handle volume, velocity, and variety of data, this flexibility and scale can present challenges for securing and governing the data. Plan to attend this session to understand the Hadoop Security Maturity Model—from the fundamentals to the latest developments--and how to ensure your data analytics cluster complies with the latest INFOSEC standards and audit requirements. Bring your experience and your questions to this informative and interactive cybersecurity session.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
Cloudera Enterprise can be used as an adaptive, high-performance analytic database, complementing existing data warehouses by relieving the pressure of growing numbers of ETL jobs and BI analytics. But where do you get started when developing your offload strategy? How can you identify which workloads are the best fit for which system? And once you’re up and running, how can you constantly adapt to Hadoop’s changing data needs?
Cloudera Navigator Optimizer eases the path for moving the right workloads to Hadoop and then actively manages data allowing you to take advantage of Hadoop’s benefits. Now generally available with the recent release of Cloudera 5.8 and a unique part of Cloudera’s analytic database solution, Navigator Optimizer gives you the workload visibility and assessments to build a predictable offload plan, adapt to evolving data and workload demands, and optimize query performance for Hadoop technologies
3 Things to Learn:
Join Ewa Ding, Senior Product Manager at Cloudera, as she discusses:
-An overview of Cloudera Navigator Optimizer and its key features
-A live demo and key use cases of this web-based tool
-What’s next for active data optimization in Hadoop
Cloudera Tech Day Presentation by Eva Andreasson, Director Product Management, Cloudera.
Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Discover the origins of big data, discuss existing and new projects, share common use cases for those projects, and explain how you can modernize your architecture using data analytics, data operations, data engineering and data science.
Big Data Fundamentals is your prerequisite to building a modern platform for machine learning and analytics optimized for the cloud.
We’ll close out with a live Q&A with some of our technical experts as well.
Stretch your brain with a packed agenda:
Open source software
Data storage
Data ingestion
Data analytics
Data engineering
IoT and life after Lambda architectures
Data science
Cybersecurity
Cluster management
Big data in the cloud
Success stories
Topics including: The transformative value of real-time data and analytics, and current barriers to adoption. The importance of an end-to-end solution for data-in-motion that includes ingestion, processing, and serving. Apache Kudu’s role in simplifying real-time architectures.
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
We hope this session was valuable in teaching you more about Cloudera Enterprise on AWS, and how fast and easy it is to deploy a modern data management platform—in your cloud and on your terms.
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
Learn about the skills and tools a data scientist needs and how to start training to be one.
There's so much noise about what a data scientist is or isn't that it can be challenging to identify the skills needed to start training a team or becoming one yourself. What exactly is a data scientist and where do you start?
Cloudera's Director of Data Science, Sean Owen, will start by walking through the different skills data scientist should have and why businesses need them. Afterwards, Tom Wheeler, Cloudera's Principal Curriculum Developer, will introduce the latest data science course developed by Cloudera University designed to help people take their first steps to becoming a data scientist.
Self-service Big Data Analytics on Microsoft AzureCloudera, Inc.
In this presentation Microsoft will join Cloudera to introduce a new Platform-as-a-Service (PaaS) offering that helps data engineers use on-demand cloud infrastructure to speed the creation and operation of data pipelines that power sophisticated, data-driven applications - without onerous administration.
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
One benefit of Apache Hadoop is the ability to power multiple workloads, across many different users and departments, all within a single, shared cluster. Hear how BT is doing this today and learn about new features in Cloudera Manager to provide better visibility for multi-tenant operations.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
Cox Communications, one of the largest network providers in the U.S., is primarily focused on ensuring network security and providing better service to customers including:
• Real-time monitoring of IP security traffic to identify and alert the unusual network activities across interfaces within an organization
• Enrich the security team with capabilities to determine the source and destination of traffic, class of service, and the causes of congestion on NetFlow data
Challenges:
Data related to Network Security includes more granular streaming data. The major challenge lies in having an unified platform to perform data cleansing, transformation, analytics and reporting on this huge streaming datasets. With the growing network traffic, there is an exponential growth with the associated data. There is a need for Scalable framework to handle these datasets and derive useful information out of data. Along with data processing, data retrieval also plays a major role for better analysis. Currently Data processing was done in daily batch using manual python scripts and with implementation of custom data structures which were specific to use cases. There was a need for more generic and unified framework to provide automated real time end to end solution to obtain high performing, more granular business results.
Solution:
Automation of this process has opportunities on several fronts, notably, providing consistency, repeat-ability, and modernization of OLAP analytics on enterprise big data platform. Reports can be generated easier and faster with the underlying OLAP engine.
• Modern Big Data Platform provides the necessary tool and infrastructure to land, cleanse, process Real time stream data processing and enriching data using the ecosystem components like Spark, Kafka, Hive
• Impressively faster OLAP analytics using Hive LLAP and Druid Integration
• Simple and faster reporting using Superset
All of the necessary components under one roof of Hortonworks Hadoop Platform.
An end-to-end solution using Big Data platform produced faster and repeatable results with sub second query results.
Value Additions by above solution:
• Deliver ultra-fast SQL analytics that can be consumed from the BI tool by security engineering team to get accelerated business results
• Opportunity for business users to explore and visualize real time streaming datasets with integration for various data sources and build dashboards for different slices
• Capability to run BI queries in just milliseconds over 1TB dataset
• High granular permission model on security datasets that allow intricate rules on accessibility for the datasets
The Document provides an overview of
the key security challenges in Big Data (Apache Hadoop)systems, and showcases the solutions used by Hortonworks Distribution to solve these security challenges.
Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and cyber security, how we preserve privacy whilst exploiting the advantages of data collection and processing. Big Data technologies provide both governments and corporations powerful tools to offer more efficient and personalized services. The rapid adoption of these technologies has of course created tremendous social benefits. Unfortunately unwanted side effects are the potential rich pickings available to those with malicious intentions. Increasingly, the sophisticated cyber attacker is able to exploit the rich array public data to build detailed profiles on their adversaries to support their malicious intentions
Hadoop Security Features that make your risk officer happyAnurag Shrivastava
This talk was delivered by Anurag Shrivastava at Hadoop Summit 2015 Brussels. It covers how Apache Ranger, Apache Sentry, Apache Knox and Project Rhino can help you pass IT risk assessment in Hadoop projects.
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
We are in the midst of a fundamental shift in the way in which organizations protect themselves from the modern adversary.
Traditional rules based cybersecurity applications of the past are not able to protect organizations in the new mobile, social, and hyper-connected world they now operate within. However, the convergence of big data technology, analytic advancements, and a variety of other factors have sparked a cybersecurity renaissance that will forever change the way in which organizations protect themselves.
Join Rocky DeStefano, Cloudera's Cybersecurity subject matter expert, as he explores how modern organizations are protecting themselves from more frequent, sophisticated attacks.
During this webinar you will learn about:
The current challenges cybersecurity professionals are facing today
How big data technologies are extending the capabilities of cybersecurity applications
Cloudera customers that are future proofing their cybersecurity posture with Cloudera’s next generation data and analytics management system
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
Cloudera empowers cybersecurity innovators to proactively secure the enterprise by accelerating threat detection, investigation, and response through machine learning and complete enterprise visibility. Cloudera’s cybersecurity solution, based on Apache Spot, enables anomaly detection, behavior analytics, and comprehensive access across all enterprise data using an open, scalable platform. But what’s the easiest way to get started?
Join Cloudera, StreamSets, and Arcadia Data as we show you first hand how we have made it easier to get your first use case up and running. During this session you will learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
3 things to learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
Equinix Big Data Platform and Cassandra - A view into the journeyPraveen Kumar
Story of building Big Data Platform in Equinix to cater a number of use cases. It explains journey and selection of Cassandra for NoSQL solution sitting in the heart of the platform. Storm , flume, AMQ, Drools, Solr technologies playing an important role in the platform. Platform processing large amounts of data in real-time.
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...Cloudera, Inc.
One of the benefits of Hadoop is that it easily allows for multiple entry points both for data flow and user access. Here we discuss how Cloudera allows you to preserve the agility of having multiple entry points while also providing strong, easy to manage authentication. Additionally, we discuss how Cloudera provides unified authorization to easily control access for multiple data processing engines.
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks
As more data is imported into Hadoop Data Lakes, how can we best secure sensitive data? Recording is at: https://www.brighttalk.com/webcast/9573/171957
What security options are available and what kind of best practices should be implemented? Join our two speakers as they discuss securing HDP data lakes to leverage security in Hadoop without sacrificing usability. Presenters: Vincent Lam, Protegrity - Syed Mahmood, Hortonworks.
You’ll learn about:
· The 5 Pillars of Security for Hadoop
· Open Source HDP Security
· How Hortonworks leverages Protegrity to jointly offer the most robust Hadoop protection available
· The benefits and differences of data protection including tokenization, encryption, and masking
· Leveraging consistent security across Hadoop and beyond for protection of data across its lifecycle
VMworld 2013
Chris Greer, FedEx
Richard McDougall, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
1. Combat Cyber Threats
with Cloudera Impala & Apache Hadoop
Justin Erickson | Director, Product Management, Cloudera
Wayne Wheeles | Analytic, Infrastructure and Enrichment Developer Cyber
Security, Six3 Systems
July 2013
2. Agenda
What’s new in Impala?
• Impala recap
• Impala 1.1
• Authorization with Sentry
Cyber security with Impala
• Cyber security demo overview
• Working with WebProxy Data
• Working with Netflow Data
• IDS Amplification and Correlation “holy grail use case”
• Discussion and questions
2
3. Cloudera Impala
3
Interactive SQL for Hadoop
Responses in seconds
ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
Purpose-built for low-latency queries
Separate runtime from MapReduce
Designed as part of the Hadoop ecosystem
Open Source
Apache-licensed
4. Benefits of Impala
4
More & Faster Value from “Big Data”
Interactive BI/analytics experience via SQL
No delays from data migration
Flexibility
Query across existing data
Select best-fit file formats (Parquet, Avro, etc.)
Run multiple frameworks on the same data at the same time
Cost Efficiency
Reduce movement, duplicate storage & compute
10% to 1% the cost of analytic DBMS
Full Fidelity Analysis
No loss from aggregations or fixed schemas
6. Previous State of Authorization
6
Insecure Advisory Authorization
Users can grant themselves permissions
Intended to prevent accidental deletion of data
Problem: Doesn’t guard against malicious users
HDFS Impersonation
Data is protected at the file level by HDFS permissions
Problem: File-level not granular enough
Problem: Not role-based
Two Sub-Optimal Choices for SQL on Hadoop
7. Sentry with CDH4.3 Hive and Impala 1.1
7
Secure Authorization
Ability to control access to data and/or privileges on data for
authenticated users
Fine-Grained Authorization
Ability to give users access to a subset of data in a database
Role-Based Authorization
Ability to create/apply templatized privileges based on
functional roles
Multi-Tenant Administration
Ability for central admin group to empower lower-level
admins to manage security for each database/schema
8. Part of an overall infosec landscape
8
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption
Data masking
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
SentryKerberos | Oozie | Knox Cloudera NavigatorCertified Partners
Available 7/23
9. Agenda – Cyber security with Impala
What’s new in Impala?
• Impala recap
• Impala 1.1
• Authorization with Sentry
Cyber security with Impala
• Cyber security demo overview
• Working with WebProxy Data
• Working with Netflow Data
• IDS Amplification and Correlation “holy grail use case”
• Discussion and questions
9
10. Impala Mission Demonstration Platform
10
Application Server
Cloudera - CDH 4 Cluster
sherpa4
sherpa3 sherpa2 sherpa1
• Cloudera Manager
• HDFS
• Impala
• HBASE
• MR
• HIVE
• HDFS
• Impala
• HBASE
• MR
• HIVE
• HDFS (NN)
• Impala (State Store)
• HBASE(RS)
• MR
• HUE
• Oozie
• Zookeeper
• HIVE
Organization
Network
Gateway to
Internet
S
E
N
S
O
R
Netflow
WebProxy
IDS
11. Demo Platform Data Sets
Webinar Data Sets
• Netflow Data
• The term flow refers to a single data flow
connection between two hosts, defined
uniquely by its five-tuple.
• http://tools.netsa.cert.org/silk/
• IDS/IPS Data
• a device or software application that
monitors network or system activities for
malicious activities or policy violations and
produces reports to a management station
• http://www.snort.org
• WebProxy Data
• WebProxy for request by users within the
corporate domain.
Enrichment Data Sets
• Geographic enrichment
• Geo-location information of addresses
• http://dev.maxmind.com/
• Blacklist Information
• Address list of addresses identified as
potential threat
• http://www.autoshun.org/
• Whitelist Information
• Addresses known located within the
corporate network
• Statistical Cubes
• Cubes built for the purpose of providing
statistical amplification for analysis
11
13. 13
Why Impala for Cyber Security?
Cloudera Impala and HDFS are a great choice for cyber
security:
• Offers one powerful and secure platform for
structured and unstructured data.
• Uniquely provides the capability to store large
amounts of data at a acceptable price point.
• Sentry provides even greater protection for your
cyber security data.
14. Thank You
• Ask questions on the Q&A tab
• Recording will be available
at cloudera.com
• After webinar, inquire at:
info@cloudera.com
• Contact info:
Email:
sherpasurfing@gmail.com
impala-user@cloudera.org
Twitter:
@WayneWheeles
@JustinErickson
@Cloudera
14
Cloudera Impala
cloudera.com/impala
“Imagination is more important than
knowledge. For knowledge is limited to all
we now know and understand, while
imagination embraces the entire world, and
all there ever will be to know and
understand.”
~Albert Einstein
Six3 Cyber Security Demo
https://github.com/sherpasurfing
Editor's Notes
Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-100x faster than HiveNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
More & Faster Value from Big DataProvides an interactive BI/Analytics experience on HadoopPreviously BI/Analytics was impractical due to the batch orientation of MapReduceEnables more users to gain value from organizational data assets (SQL/BI users)Makes more data available for analysis (raw data, multi-structured data, historical data)Removes delays from data migrationInto specialized analytical DBMSsInto proprietary file formats that happen to be stored in HDFSInto transient in-memory storesFlexibilityQuery across existing data in HadoopHDFS and HBaseAccess data immediately and directly in its native formatSelect best-fit file formatsUse raw data formats when unsure of access patterns (text files, RCFiles, LZO)Increase performance with optimized file formats when access patterns are known (Parquet, Avro)All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same timeCost EfficiencyReduce movement, duplicate storage & computeData movement: no time or resource penalty for migrating data into specialized systems or formatsDuplicate storage: no need to duplicate data across systems or within the same system in different file formatsCompute: use the same compute resources as the rest of the Hadoop system – You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce)You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions10% to 1% the cost of analytic DMBSLess than $1,000/TBFull Fidelity AnalysisNo loss of fidelity from aggregations or conforming to fixed schemasIf the attribute exists in the raw data, you can query against it
This is an overview of my simple cluster I put together for the Webinar, 4 nodes in total: 3 node Hadoop Cluster and an Application Server.So the configuration here is one that would be present in many public and private organizationsWe have placed a sensor at the gateway or gateway(s) across the enterprise monitoring traffic incoming and outgoing.This information is captured by a variety of sensor/collectors and written to files on a regular basis.So now lets go through the data sets.
1.) Provide a brief tour of the cluster using Cloudera Manager