Are you new to Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?
If you're wondering whether Cloudera's Developer Training for Apache Hadoop is right for you and your team, then this presentation is right for you. You will learn who is best suited to attend the live training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during the course and how they will help you become a full-fledged Hadoop application developer.
During the session, Ian will also present a short portion of the actual Cloudera Developer course, discussing the difference between New and Old APIs, why there are different APIs, and which you should use when writing your MapReduce code. Following the presentation, Ian will answer your questions about this or any of Cloudera’s other training courses.
Visit the resources section of cloudera.com to view the on-demand webinar.
Learn who is best suited to attend the full training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Developer, Jesse Anderson, will discuss the skills you will attain during the course and how they will help you move make the most of your HBase deployment in development or production and prepare for the Cloudera Certified Specialist in Apache HBase (CCSHB) exam.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
This document discusses YARN's shared cache feature for application resources. It provides an overview of how YARN localizes resources for each application and containers. The shared cache aims to address inefficiencies in this process by caching identical resources on NodeManagers and sharing them between applications and containers. The design goals are for the shared cache to be scalable, secure, fault-tolerant and transparent. It works by having a shared cache client interface with a shared cache manager that maintains metadata and persisted resources. This can significantly reduce data transfer and localization costs for applications that reuse common resources.
The document discusses functional programming concepts and their application to big data problems. It provides an overview of functional programming foundations and languages. Key functional programming concepts discussed include first-class functions, pure functions, recursion, and immutability. These concepts are well-suited for data-centric applications like Hadoop MapReduce. The document also presents a case study comparing an imperative approach to a transaction processing problem to a functional approach, showing that the functional version was faster and avoided side effects.
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
DeathStar is a system that runs HBase on YARN to provide easy, dynamic multi-tenant HBase clusters via YARN. It allows different applications to run HBase in separate application-specific clusters on a shared HDFS and YARN infrastructure. This provides strict isolation between applications and enables dynamic scaling of clusters as needed. Some key benefits are improved cluster utilization, easier capacity planning and configuration, and the ability to start new clusters on demand without lengthy provisioning times.
Learn who is best suited to attend the full training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Developer, Jesse Anderson, will discuss the skills you will attain during the course and how they will help you move make the most of your HBase deployment in development or production and prepare for the Cloudera Certified Specialist in Apache HBase (CCSHB) exam.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Arun C Murthy, Founder and Architect at Hortonworks Inc., talks about the upcoming Next Generation Apache Hadoop MapReduce framework at the Hadoop Summit, 2011.
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
This document discusses YARN's shared cache feature for application resources. It provides an overview of how YARN localizes resources for each application and containers. The shared cache aims to address inefficiencies in this process by caching identical resources on NodeManagers and sharing them between applications and containers. The design goals are for the shared cache to be scalable, secure, fault-tolerant and transparent. It works by having a shared cache client interface with a shared cache manager that maintains metadata and persisted resources. This can significantly reduce data transfer and localization costs for applications that reuse common resources.
The document discusses functional programming concepts and their application to big data problems. It provides an overview of functional programming foundations and languages. Key functional programming concepts discussed include first-class functions, pure functions, recursion, and immutability. These concepts are well-suited for data-centric applications like Hadoop MapReduce. The document also presents a case study comparing an imperative approach to a transaction processing problem to a functional approach, showing that the functional version was faster and avoided side effects.
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
DeathStar is a system that runs HBase on YARN to provide easy, dynamic multi-tenant HBase clusters via YARN. It allows different applications to run HBase in separate application-specific clusters on a shared HDFS and YARN infrastructure. This provides strict isolation between applications and enables dynamic scaling of clusters as needed. Some key benefits are improved cluster utilization, easier capacity planning and configuration, and the ability to start new clusters on demand without lengthy provisioning times.
Spark is an open-source software framework for rapid calculations on in-memory datasets. It uses Resilient Distributed Datasets (RDDs) that can be recreated if lost and supports transformations and actions on RDDs. Spark is useful for batch, interactive, and real-time processing across various problem domains like SQL, streaming, and machine learning via MLlib.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
This document discusses deep learning and DL4J. It begins with an overview of deep learning, describing it as automated feature engineering through chained techniques like restricted Boltzmann machines. It then introduces DL4J, describing it as an enterprise-grade deep learning library for Java, Scala, and Python that supports parallelization on YARN and Spark as well as GPUs. The rest of the document discusses using DL4J with Spark for deep learning workflows on large datasets and provides an example of using the DL4J tool suite to perform vectorization, training, and evaluation on the Iris dataset.
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
This Hadoop Pig tutorial will unravel Pig Programming, Pig Commands, Pig Fundamentals, Grunt Mode, Script Mode & Embedded Mode.
At the end, you'll have a strong knowledge regarding Hadoop Pig Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is Pig?
✓ Pig Data Flows
✓ Pig Programming
----------
What is Pig?
Pig is an open source data flow language which processes data management operations via simple scripts using Pig Latin. Pig works very closely in relation with MapReduce.
----------
Applications of Pig
1. Data Cleansing
2. Data Transfers via HDFS
3. Data Factory Operations
4. Predictive Modelling
5. Business Intelligence
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides concise summaries of Hivemall in 3 sentences or less:
Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks like classification, regression, and recommendation using SQL queries. Hivemall supports many popular machine learning algorithms and can run in parallel on large datasets using Apache Spark, Hive, Pig, and other big data frameworks. The document outlines how to run a machine learning workflow with Hivemall on Spark, including loading data, building a model, and making predictions.
Hadoop clusters are operated on an ephemeral basis in the cloud by Qubole, processing over 300 petabytes of data per month across over 100 customers. Qubole addresses challenges of ephemeral clusters through auto-scaling of resources using YARN, optimizing performance for cloud storage, and storing job history remotely. Volatile low-cost nodes are leveraged through policies that ensure data replication despite potential node failures.
Impala is a massively parallel processing SQL query engine for Hadoop. It allows users to issue SQL queries directly to their data in Apache Hadoop. Impala uses a distributed architecture where queries are executed in parallel across nodes by Impala daemons. It uses a new execution engine written in C++ with runtime code generation for high performance. Impala also supports commonly used Hadoop file formats and can query data stored in HDFS and HBase.
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
This document discusses the Cloud and Information Services Lab (CISL) and its vision of having one cluster that can run all workloads. It describes CISL's research into improving resource management in shared clusters through projects like Mercury, which aims to improve cluster utilization by opportunistically using otherwise idle resources. The document outlines Mercury's architecture and how it implements a hybrid of centralized and distributed scheduling to better schedule tasks with short durations.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
Remote storage provides the ability to separate compute and storage, which ushers in a new world of infinitely scalable and cost-effective storage. Remote storage in the cloud built to the HDFS standard has unique features that make it a great choice for storing and analyzing petabytes of data at a time. Customers can have unlimited storage capacity without any limit to the number or size of the files. With such scale, superior I/O performance becomes an increasingly important consideration when performing analysis on this data. For all workloads, a remote storage in the cloud can provide amazing performance when all the different knobs are tuned correctly...
Speaker
Stephen Wu, Senior Program Manager, Microsoft
Sanjay Radia presents on evolving HDFS to support a generalized storage subsystem. HDFS currently scales well to large clusters and storage sizes but faces challenges with small files and blocks. The solution is to (1) only keep part of the namespace in memory to scale beyond memory limits and (2) use block containers of 2-16GB to reduce block metadata and improve scaling. This will generalize the storage layer to support containers for multiple use cases beyond HDFS blocks.
Hortonworks Big Data Career Paths and Training Aengus Rooney
Hortonworks provides training and resources for working with big data and Apache Hadoop. It employs many of the committers to the Apache Hadoop project and influences the project's roadmap. Hortonworks nourishes the open source community through resources like community forums, documentation, and a large partner network. It offers full lifecycle support for customers through subscriptions, consulting, training programs, and certifications.
Hadoop is a distributed processing framework for large datasets. It utilizes HDFS for storage and MapReduce as its programming model. The Hadoop ecosystem has expanded to include many other tools. YARN was developed to address limitations in the original Hadoop architecture. It provides a common platform for various data processing engines like MapReduce, Spark, and Storm. YARN improves scalability, utilization, and supports multiple workloads by decoupling cluster resource management from application logic. It allows different applications to leverage shared Hadoop cluster resources.
This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
This document provides an overview of Cloudera's "Data Analyst Training: Using Pig, Hive, and Impala with Hadoop" course. The course teaches data analysts how to use Pig, Hive, and Impala for large-scale data analysis on Hadoop. It covers loading and analyzing data with these tools, choosing the best tool for different jobs, and includes hands-on exercises. The target audience is data analysts and others interested in using Pig, Hive and Impala for big data analytics.
Spark is an open-source software framework for rapid calculations on in-memory datasets. It uses Resilient Distributed Datasets (RDDs) that can be recreated if lost and supports transformations and actions on RDDs. Spark is useful for batch, interactive, and real-time processing across various problem domains like SQL, streaming, and machine learning via MLlib.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
This document discusses deep learning and DL4J. It begins with an overview of deep learning, describing it as automated feature engineering through chained techniques like restricted Boltzmann machines. It then introduces DL4J, describing it as an enterprise-grade deep learning library for Java, Scala, and Python that supports parallelization on YARN and Spark as well as GPUs. The rest of the document discusses using DL4J with Spark for deep learning workflows on large datasets and provides an example of using the DL4J tool suite to perform vectorization, training, and evaluation on the Iris dataset.
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
This Hadoop Pig tutorial will unravel Pig Programming, Pig Commands, Pig Fundamentals, Grunt Mode, Script Mode & Embedded Mode.
At the end, you'll have a strong knowledge regarding Hadoop Pig Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is Pig?
✓ Pig Data Flows
✓ Pig Programming
----------
What is Pig?
Pig is an open source data flow language which processes data management operations via simple scripts using Pig Latin. Pig works very closely in relation with MapReduce.
----------
Applications of Pig
1. Data Cleansing
2. Data Transfers via HDFS
3. Data Factory Operations
4. Predictive Modelling
5. Business Intelligence
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides concise summaries of Hivemall in 3 sentences or less:
Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks like classification, regression, and recommendation using SQL queries. Hivemall supports many popular machine learning algorithms and can run in parallel on large datasets using Apache Spark, Hive, Pig, and other big data frameworks. The document outlines how to run a machine learning workflow with Hivemall on Spark, including loading data, building a model, and making predictions.
Hadoop clusters are operated on an ephemeral basis in the cloud by Qubole, processing over 300 petabytes of data per month across over 100 customers. Qubole addresses challenges of ephemeral clusters through auto-scaling of resources using YARN, optimizing performance for cloud storage, and storing job history remotely. Volatile low-cost nodes are leveraged through policies that ensure data replication despite potential node failures.
Impala is a massively parallel processing SQL query engine for Hadoop. It allows users to issue SQL queries directly to their data in Apache Hadoop. Impala uses a distributed architecture where queries are executed in parallel across nodes by Impala daemons. It uses a new execution engine written in C++ with runtime code generation for high performance. Impala also supports commonly used Hadoop file formats and can query data stored in HDFS and HBase.
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
This document discusses the Cloud and Information Services Lab (CISL) and its vision of having one cluster that can run all workloads. It describes CISL's research into improving resource management in shared clusters through projects like Mercury, which aims to improve cluster utilization by opportunistically using otherwise idle resources. The document outlines Mercury's architecture and how it implements a hybrid of centralized and distributed scheduling to better schedule tasks with short durations.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
Remote storage provides the ability to separate compute and storage, which ushers in a new world of infinitely scalable and cost-effective storage. Remote storage in the cloud built to the HDFS standard has unique features that make it a great choice for storing and analyzing petabytes of data at a time. Customers can have unlimited storage capacity without any limit to the number or size of the files. With such scale, superior I/O performance becomes an increasingly important consideration when performing analysis on this data. For all workloads, a remote storage in the cloud can provide amazing performance when all the different knobs are tuned correctly...
Speaker
Stephen Wu, Senior Program Manager, Microsoft
Sanjay Radia presents on evolving HDFS to support a generalized storage subsystem. HDFS currently scales well to large clusters and storage sizes but faces challenges with small files and blocks. The solution is to (1) only keep part of the namespace in memory to scale beyond memory limits and (2) use block containers of 2-16GB to reduce block metadata and improve scaling. This will generalize the storage layer to support containers for multiple use cases beyond HDFS blocks.
Hortonworks Big Data Career Paths and Training Aengus Rooney
Hortonworks provides training and resources for working with big data and Apache Hadoop. It employs many of the committers to the Apache Hadoop project and influences the project's roadmap. Hortonworks nourishes the open source community through resources like community forums, documentation, and a large partner network. It offers full lifecycle support for customers through subscriptions, consulting, training programs, and certifications.
Hadoop is a distributed processing framework for large datasets. It utilizes HDFS for storage and MapReduce as its programming model. The Hadoop ecosystem has expanded to include many other tools. YARN was developed to address limitations in the original Hadoop architecture. It provides a common platform for various data processing engines like MapReduce, Spark, and Storm. YARN improves scalability, utilization, and supports multiple workloads by decoupling cluster resource management from application logic. It allows different applications to leverage shared Hadoop cluster resources.
This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
This document provides an overview of Cloudera's "Data Analyst Training: Using Pig, Hive, and Impala with Hadoop" course. The course teaches data analysts how to use Pig, Hive, and Impala for large-scale data analysis on Hadoop. It covers loading and analyzing data with these tools, choosing the best tool for different jobs, and includes hands-on exercises. The target audience is data analysts and others interested in using Pig, Hive and Impala for big data analytics.
Deploying Enterprise-grade Security for HadoopCloudera, Inc.
Deploying enterprise grade security for Hadoop or six security problems with Apache Hive. In this talk we will discuss the security problems with Hive and then secure Hive with Apache Sentry. Additional topics will include Hadoop security, and Role Based Access Control (RBAC).
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
Learn how CARFAX utilized the power of Control-M to help drive big data processing via Cloudera. See why it was a no-brainer to choose Control-M to help manage workflows through Hadoop, some of the challenges faced, and the benefits the business received by using an existing, enterprise-wide workload management system instead of choosing “yet another tool.”
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.
Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...mindscriptsseo
MindScripts is the best Big-Data Hadoop Training Institute/Center in Pune providing complete courses including Cloudera, Hortonworks, HDFS, MapReduce, Pig, Hive, Sqoop, ZooKeeper. The course is designed keeping CloudEra Certification syllabus in mind.
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkSequelGate
Hadoop Data Science Training
Microsoft consulted data scientists and the companies that employ them to identify the core skills they need to be successful. This informed the curriculum used to teach key functional and technical skills, combining highly rated online courses with hands-on labs, concluding in a final capstone project.
Build your operator with the right toolRafał Leszko
The document discusses different tools that can be used to build Kubernetes operators, including the Operator SDK, Helm, Ansible, Go, and operator frameworks like KOPF. It provides an overview of how each tool can be used to generate the scaffolding and implement the logic for a sample Hazelcast operator.
Build Your Kubernetes Operator with the Right Tool!Rafał Leszko
The document discusses different tools and frameworks for building Kubernetes operators, including the Operator SDK, Helm, Ansible, Go, KOPF, Java Operator SDK, and using bare programming languages. It provides examples of creating operators using the Operator SDK with Helm, Ansible and Go plugins, and also using the KOPF Python framework. The document highlights the key steps and capabilities of each approach.
Bonjour à tous,
Pour ce meetup, nous avons la chance d'être reçu dans les locaux de Richemont.
Je remercie particulièrement Cédric Georg ainsi que l'équipe de Richemont pour leur accueil.
A ce meetup DevOps, nous aurons 2 Retours d'Expérience, voici l'agenda de la soirée:
18:30 - Ouverture des portes
(il faudra donner votre nom et prénom ainsi que votre numéro de plaque d'immatriculation si vous êtes venu en voiture, c'est pour la sécurité, et oui, on ne rigole pas ici :-))
18:50 - Introduction de Matthieu et de Cédric
19:00 - Richemont et sa transformation DevOps
Richemont, fort de sa transformation digitale, a dû s'adapter afin de faire travailler ensemble, avec des outils d'automatisation et de communication, les équipes de développeurs et les équipes opérationnelles.
Squad, DevOps, Tests, Sécurité, Agile et Scrum, comment tous ces termes ont sû devenir le quotidien de Richemont en seulement quelques années.
Nous verrons comment nous avons mis cela en place, quels ont été les points positifs et négatifs de cette transformation.
19:40 - SixSq et l'automatisation du docker sur des edge points (DEMO)
Edge computing is gaining in popularity to address the explosion of data produced by IoT sensors, and the need to better manage AI both in the cloud and at the edge. To address this paradigm shift, SixSq has launched two open source projects: Nuvla for managing applications, and NuvlaBox, a cloud-in-a-box edge solution.
Using these open source projects, in this session we'll demonstrate how edge computing can now be integrated to agnostically operate containerized applications on CaaS infrastructures anywhere, using a Raspberry Pi-based platform.
The UberCloud - From Project to Product - From HPC Experiment to HPC Marketpl...Wolfgang Gentzsch
The UberCloud online marketplace for engineers and scientists to discover, try, and buy compute power on demand, in the cloud. Starting with free experiments in the cloud, including application software, cloud hardware, and expertise. Learning by doing how to use your application in the cloud.
The UberCloud online marketplace for engineers and scientists to discover, try, and buy compute power on demand, in the cloud. Starting with free experiments in the cloud, including application software, cloud hardware, and expertise. Learning by doing how to use your application in the cloud.
info.theubercloud.com/case-studies-and-resources
Developer joy for distributed teams with CodeReady Workspaces | DevNation Tec...Red Hat Developers
Natale Vinto discusses CodeReady Workspaces, a developer environment tool that runs on OpenShift. It provides containerized developer workspaces that enable coding directly within Kubernetes clusters. Key features include the Eclipse Che IDE, compatibility with VSCode extensions, and use of "devfiles" to define and standardize reproducible developer environments. CodeReady Workspaces aims to improve productivity for remote and distributed teams by reducing setup times and enabling self-service access to development environments.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Todd Lipcon gives a presentation introducing Apache Spark. He begins with an overview of Spark, explaining that it is a general purpose computational framework that improves on MapReduce by leveraging distributed memory for better performance and providing a more developer-friendly API. Lipcon then discusses Spark's Resilient Distributed Datasets (RDDs) and its expressive transformations and actions API. He provides examples of word count programs in Java and Scala. Lipcon also highlights Spark's integration with Hadoop, built-in machine learning library MLlib, and streaming capabilities through Spark Streaming.
Kubernetes is much more than just a container orchestration platform … alongside The Cloud Native Landscape Kubernetes is the equivalent to Linux's kernel with an ecosystem of apps/util which enrich it.
This document proposes a project called Learn By Doing (LBD) to demonstrate an "Acquisition 2.0" approach to cloud computing procurement. The LBD project would involve standing up a hybrid cloud using open source software to provide infrastructure and platform services. This cloud environment would serve as an "innovation sandbox" and procurement example. The project aims to help agencies better understand cloud types and procurement while providing a working system to develop requirements and contracting documents in collaboration with stakeholders.
GraphQL can be one of the best ways to make your product development more fun and productive. In this presentation I talk about how GraphQL makes your life simpler, and how to write and deploy a GraphQL API with Apollo Server 2.0 and serverless deployment via Netlify Functions.
This document provides an agenda and overview for a webinar on Kubernetes. The agenda includes an introduction to Kabisa, an introduction to Kubernetes concepts, and a hands-on Kubernetes workshop. Kabisa is introduced as a software development agency specialized in custom web and mobile app development with over 14 years of experience. Key Kubernetes concepts are then summarized, including clusters, nodes, pods, namespaces, replica sets, load balancers, and deployments. Finally, the hands-on workshop is outlined which will have participants claim a Kubernetes cluster and complete tasks like creating pods, services, and using deployments, environment variables, secrets, and config maps.
This document discusses Apigility-powered RESTful APIs on IBM i systems. It covers API concepts, installing Apigility, creating RESTful web services, using the Apigility toolkit, and error handling. The presentation discusses installing Apigility locally or remotely, designing URI patterns, using the admin interface to create services, adding database and toolkit services, and calling the toolkit from PHP, CL, and RPG code. It also provides tips on best practices like abstracting toolkit calls and using commands and queries.
Kubernetes has become the defacto standard as a platform for container orchestration. Its ease of extending and many integrations has paved the way for a wide variety of data science and research tooling to be built on top of it.
From all encompassing tools like Kubeflow that make it easy for researchers to build end-to-end Machine Learning pipelines to specific orchestration of analytics engines such as Spark; Kubernetes has made the deployment and management of these things easy. This presentation will showcase some of the larger research tools in the ecosystem and go into how Kubernetes has enabled this easy form of application management.
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
Kubeflow Pipelines and TensorFlow Extended (TFX) together is end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. In this talk we describe how how to run TFX in hybrid cloud environments.
K8s in 3h - Kubernetes Fundamentals TrainingPiotr Perzyna
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. This training helps you understand key concepts within 3 hours.
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...Srijan Technologies
Drupal has been a consistent leader in the Gartner Magic Quadrant for Web Content Management. However, enterprises leveraging Drupal have traditionally relied on PaaS providers for their hosting, scaling and lifecycle management. And that usually leads to enterprise applications being locked-in with a particular cloud or vendor.
As container and container orchestration technologies disrupt the cloud and platform landscape, there’s a clear way to avoid this state of affairs. In this webinar, we discuss why it's important to build a cloud-native Drupal platform, and exactly how to do that.
Join the webinar to understand how you can avoid vendor lock-in, and create a secure platform to manage, operate and scale your Drupal applications in a multi-cloud portable manner.
Key Takeaways:
- Why you need a cloud-native Drupal platform and how to build one
- How to craft an idiomatic development workflow
- Understanding infrastructure and cloud engineering - under the hood
- Demystifying the art and science of Docker and Kubernetes: deep dive into scaling the LAMP stack
- Exploring cost optimization and cloud governance
- Understand portability of applications
- A hands-on demo of how the platform works
Similar to Introduction to Hadoop Developer Training Webinar (20)
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Introduction to Hadoop Developer Training Webinar
1. An Introduction to Cloudera’s
Hadoop Developer Training Course
Ian Wrigley
Curriculum Manager
1
2. Welcome to the Webinar!
All lines are muted
Q & A after the presentation
Ask questions at any time by typing them in the
WebEx panel
A recording of this Webinar will be available on
demand at cloudera.com
2
3. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
3
4. Cloudera’s Training is the Industry Standard
Big Data Cloudera has trained
professionals from employees from
55% 100%
of the Fortune 100 of the top 20 global
have attended live technology firms to
Cloudera training use Hadoop
Cloudera has trained over
15,000
students
4
5. Cloudera Training: The Benefits
1 Broadest Range of Courses
Cover all the key Hadoop components 5 Widest Geographic Coverage
Most classes offered: 20 countries plus virtual classroom
2 Most Experienced Instructors
Over 15,000 students trained since 2009 6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
3 Leader in Certification
Over 5,000 accredited Cloudera professionals 7 Depth of Training Material
Hands-on labs and VMs support live instruction
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
5
6. The professionalism and expansive
technical knowledge of our classroom
instructor was incredible. The quality of
the training was on par with a university.
6
7. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
7
8. Common Attendee Profiles
Software Developers/Engineers
Business analysts
IT managers
Hadoop system administrators
8
9. Course Pre-Requisites
Programming experience
Knowledge of Java highly recommended
Understanding of common computer science
principles is helpful
Prior knowledge of Hadoop is not required
9
10. Who Should Not Attend?
If you have no programming experience, you’re likely
to find the course very difficult
You might consider our Hive and Pig training course instead
If you will be focused solely on configuring and
managing your cluster, our Administrator training
course would probably be a better alternative
10
11. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
11
12. Developer Training: Overview
The course assumes no pre-existing knowledge of
Hadoop
Starts by discussing the motivation for Hadoop
What problems exist that are difficult (or impossible) to
solve with existing systems
Explains basic Hadoop concepts
The Hadoop Distributed File System (HDFS)
MapReduce
Introduces the Hadoop API (Application Programming
Interface)
12
13. Developer Training: Overview (cont’d)
Moves on to discuss more complex Hadoop concepts
Custom Partitioners
Custom Writables and WritableComparables
Custom InputFormats and OutputFormats
Investigates common MapReduce algorithms
Sorting, searching, indexing, joining data sets, etc.
Then covers the Hadoop ‘ecosystem’
Hive, Pig, Sqoop, Flume, Mahout, Oozie
13
15. Hands-On Exercises
The course features many Hands-On Exercises
Analyzing log files
Unit-testing Hadoop code
Writing and implementing Combiners
Writing custom Partitioners
Using SequenceFiles and file compression
Creating an inverted index
Creating custom WritableComparables
Importing data with Sqoop
Writing Hive queries
…and more
15
16. Certification
Our Developer course is good preparation for the
Cloudera Certified Developer for Apache Hadoop
(CCDH) exam
A voucher for one attempt at the exam is currently
included in the course fee
16
17. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
17
23. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
23
30. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
30
31. Conclusion
Cloudera’s Developer training course is:
Technical
Hands-on
Interactive
Comprehensive
Attendees leave the course with the skillset required
to write, test, and run Hadoop jobs
The course is a good preparation for the CCDH
certification exam
31
32. Questions?
For more information on Cloudera’s training
courses, or to book a place on an upcoming course:
http://university.cloudera.com
My e-mail address: ian@cloudera.com
Feel free to ask questions!
Hit the Q&A button, and type away
32
Editor's Notes
This topic is discussed in further detail in TDG 3e on pages 27-30 (TDG 2e, 25-27).NOTE: The New API / Old API is completely unrelated to MRv1 (MapReduce in CDH3 and earlier) / MRv2 (next-generation MapReduce, also called YARN, which will be available along with MRv1 starting in CDH4). Instructors are advised to avoid confusion by not mentioning MRv2 during this section of class, and if asked about it, to simply say that it’s unrelated to the old/new API and defer further discussion until later.
On this slide, you should point out the similarities as well as the differences between the two APIs. You should emphasize that they are both doing the same thing and that there are just a few differences in how they go about it.You can tell whether a class belongs to the “Old API” or the “New API” based on the package name. The old API contains “mapred” while the new API contains “mapreduce” instead. This is the most important thing to keep in mind, because some classes/interfaces have the same name in both APIs. Consequently, when you are writing your import statements (or generating them with the IDE), you will want to be cautious and use the one that corresponds whichever API you are using to write your code.The functions of the OutputCollector and Reporter object have been consolidated into a single Context object. For this reason, the new API is sometimes called the “Context Objects” API (TDG 3e, page 27 or TDG 2e, page 25).NOTE: The “Keytype” and “Valuetype” shown in the map method signature aren’t actual classes defined in Hadoop API. They are just placeholders for whatever type you use for key and value (e.g. IntWritable and Text). Also, the generics for the keys and values are not shown in the class definition for the sake of brevity, but they are used in the new API just as they are in the old API.
An example of maintaining sorted order globally across all reducers was given earlier in the course when Partitioners were introduced.NOTE: worker nodes are configured to reserve a portion (typically 20% - 30%) of their available disk space for storing intermediate data. If too many Mappers are feeding into too few reducers, you can produce more data than the reducer(s) could store. That’s a problem.At any rate, having all your mappers feeding into a single reducer (or just a few reducers) isn’t spreading the work efficiently across the cluster.
Use of the TotalOrderPartitioner is described in detail on pages 274-277 of TDG 3e (TDG 2e, 237-241). It is essentially based on sampling your keyspace so you can divide it up efficiently among several reducers, based on the global sort order of those keys.
But beware that this can be a naïve approach. If processing sales data this way, business-to-business operations (like plumbing supply warehouses) would likely have little or no data for the weekend since they will likely be closed. Conversely, a retail store in a shopping mall will likely have far more data for a Saturday than a Tuesday.
The upper bound on the number of reducers is based on your cluster (machines are configured to have a certain number of “reduce slots” based on the CPU, RAM and other performance characteristics of the machine). The general advice is to choose something a bit less than the max number of reduce slots to allow for speculative execution.
One factor in determining the reducer count is the reduce capacity the developer has access to (or the number of "reduce slots" in either the cluster or the user's pool). One technique is to make the reducer count a multiple of this capacity. If the developer has access to N slots, but they pick N+1 reducers, the reduce phase will go into a second "wave" which will cause that one extra reducer to potentially double the execution time of the reduce phase. However, if the developer chooses 2N or 3N reducers, each wave takes less time, but there are more "waves", so you don't see a big degradation in job performance if you need a second wave (or more waves) due to an extra reducer, a failed task, etc.Suggestion: draw a picture on the whiteboard that shows reducers running in waves, showing cluster slot count, reducer execution times, etc. to tie together the explanation of performance issues as they have been explained in the last few slides:1 reducer will run very slow on an entire data setSetting the number of reducers to the available slot count can maximize parallelism in one reducer wave. However, if you have a failure then you'll run the reduce phase of the job into a second wave, and that will double the execution time of the reduce phase of the job.Setting the number of reducers to a high number will mean many waves of shorter running reducers. This scales nicely because you don't have to be aware of the cluster size and you don't have the cost of a second wave, but it might be more inefficient for some jobs.