This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
The document discusses 7 architectural best practices for productionizing Hadoop: experience, availability, performance, scalability, adaptability, security, and economy. It defines each quality and provides examples to illustrate how to achieve each one. The key message is that while big data is about innovation, productionizing analytics is critical to realize business value from all the data. Architectural best practices can help systems meet expectations for usefulness, uptime, speed, flexibility, security, and cost-efficiency as big data implementations scale up.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Introduction to Hadoop - The EssentialsFadi Yousuf
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
Hadoop can virtualize cluster resources across tenants through abstractions like YARN application containers and HDFS files. For public clouds, Hadoop is often run on VMs for strong isolation, but the main challenge is persisting data when clusters are created and destroyed. For private clouds, Hadoop on VMs works well for test and development clusters, while Hadoop alone provides good multi-tenancy for production. If using VMs in production, understand the motivations and follow guidelines like allocating local disk and avoiding storage fragmentation.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
The document discusses 7 architectural best practices for productionizing Hadoop: experience, availability, performance, scalability, adaptability, security, and economy. It defines each quality and provides examples to illustrate how to achieve each one. The key message is that while big data is about innovation, productionizing analytics is critical to realize business value from all the data. Architectural best practices can help systems meet expectations for usefulness, uptime, speed, flexibility, security, and cost-efficiency as big data implementations scale up.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Introduction to Hadoop - The EssentialsFadi Yousuf
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
Hadoop can virtualize cluster resources across tenants through abstractions like YARN application containers and HDFS files. For public clouds, Hadoop is often run on VMs for strong isolation, but the main challenge is persisting data when clusters are created and destroyed. For private clouds, Hadoop on VMs works well for test and development clusters, while Hadoop alone provides good multi-tenancy for production. If using VMs in production, understand the motivations and follow guidelines like allocating local disk and avoiding storage fragmentation.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
What the Enterprise Requires - Business Continuity and VisibilityCloudera, Inc.
This document provides an overview and demo of Cloudera Enterprise BDR and Cloudera Navigator. Cloudera Enterprise BDR simplifies backup and disaster recovery for Hadoop by allowing users to centrally configure and monitor replication policies across services like HDFS, HBase and Hive. Cloudera Navigator provides centralized data auditing and access management for Cloudera Enterprise, allowing users to view permissions, audit data access and export audit logs for integration.
Realtime Analytics with Hadoop and HBaselarsgeorge
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
This document discusses IBM Db2 Big SQL and open source. It provides an overview of IBM's partnership with Hortonworks to extend data science and machine learning capabilities to Apache Hadoop systems. It also summarizes Db2 Big SQL's capabilities for SQL queries, performance, high availability, security, and workload management on Hadoop data. The document contains legal disclaimers about the information provided.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL features. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins, and how shuffle joins are implemented in MapReduce.
The document provides an overview of the Hadoop platform at Yahoo over the past year. It discusses the evolution of the platform infrastructure and metrics including growth in storage from 12PB to 65PB and compute capacity from 23TB to 240TB. It highlights new technologies added to the platform like CaffeOnSpark for distributed deep learning, Apache Storm for streaming analytics, and data sketches algorithms. It also discusses enhancements to existing technologies like HBase for transactions with Omid and improvements to Oozie for data pipelines. The document aims to provide insights on how the Hadoop platform at Yahoo has scaled to support growing analytics needs through consolidation, new services, and ease of use features.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
Big SQL provides an SQL interface for querying data stored in Hadoop. It uses a new query engine derived from IBM's database technology to optimize queries. Big SQL allows SQL users easy access to Hadoop data through familiar SQL tools and syntax. It supports creating and loading tables, standard SQL queries including joins and subqueries, and integrating Hadoop data with external databases in a single query.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
The document discusses deploying Hadoop in the cloud. Some key benefits of using Hadoop in the cloud include scalability, automated failover of replicated data, and cost efficiency through distributed processing and storage. Microsoft's Azure HDInsight offering provides a fully managed Hadoop and Spark service in the cloud that allows clusters to be provisioned in minutes and is optimized for analytics workloads. The Cortana Intelligence Suite integrates big data technologies like HDInsight with machine learning and data processing tools.
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments.
Speaker
Irfan Elahi, Consultant, Deloitte
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
Enroll Free Live demo of Hadoop online training and big data analytics courses online and become certified data analyst/ Hadoop developer. Get online Hadoop training & certification.
What the Enterprise Requires - Business Continuity and VisibilityCloudera, Inc.
This document provides an overview and demo of Cloudera Enterprise BDR and Cloudera Navigator. Cloudera Enterprise BDR simplifies backup and disaster recovery for Hadoop by allowing users to centrally configure and monitor replication policies across services like HDFS, HBase and Hive. Cloudera Navigator provides centralized data auditing and access management for Cloudera Enterprise, allowing users to view permissions, audit data access and export audit logs for integration.
Realtime Analytics with Hadoop and HBaselarsgeorge
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
This document discusses IBM Db2 Big SQL and open source. It provides an overview of IBM's partnership with Hortonworks to extend data science and machine learning capabilities to Apache Hadoop systems. It also summarizes Db2 Big SQL's capabilities for SQL queries, performance, high availability, security, and workload management on Hadoop data. The document contains legal disclaimers about the information provided.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL features. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins, and how shuffle joins are implemented in MapReduce.
The document provides an overview of the Hadoop platform at Yahoo over the past year. It discusses the evolution of the platform infrastructure and metrics including growth in storage from 12PB to 65PB and compute capacity from 23TB to 240TB. It highlights new technologies added to the platform like CaffeOnSpark for distributed deep learning, Apache Storm for streaming analytics, and data sketches algorithms. It also discusses enhancements to existing technologies like HBase for transactions with Omid and improvements to Oozie for data pipelines. The document aims to provide insights on how the Hadoop platform at Yahoo has scaled to support growing analytics needs through consolidation, new services, and ease of use features.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
Big SQL provides an SQL interface for querying data stored in Hadoop. It uses a new query engine derived from IBM's database technology to optimize queries. Big SQL allows SQL users easy access to Hadoop data through familiar SQL tools and syntax. It supports creating and loading tables, standard SQL queries including joins and subqueries, and integrating Hadoop data with external databases in a single query.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
The document discusses deploying Hadoop in the cloud. Some key benefits of using Hadoop in the cloud include scalability, automated failover of replicated data, and cost efficiency through distributed processing and storage. Microsoft's Azure HDInsight offering provides a fully managed Hadoop and Spark service in the cloud that allows clusters to be provisioned in minutes and is optimized for analytics workloads. The Cortana Intelligence Suite integrates big data technologies like HDInsight with machine learning and data processing tools.
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments.
Speaker
Irfan Elahi, Consultant, Deloitte
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
Enroll Free Live demo of Hadoop online training and big data analytics courses online and become certified data analyst/ Hadoop developer. Get online Hadoop training & certification.
Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers. It has two major components - the MapReduce programming model for processing large amounts of data in parallel, and the Hadoop Distributed File System (HDFS) for storing data across clusters of machines. Hadoop can scale from single servers to thousands of machines, with HDFS providing fault-tolerant storage and MapReduce enabling distributed computation and processing of data in parallel.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
This document provides an overview of Hadoop MapReduce. It discusses map operations, reduce operations, submitting MapReduce jobs, the distributed mergesort engine, the two fundamental data types of MapReduce (key-value pairs and lists), fault tolerance, scheduling, and task execution. Map operations perform transformations on individual data elements, while reduce operations combine the outputs of map tasks into final results. Hadoop MapReduce allows large datasets to be processed in parallel across clusters of computers.
This document discusses Hadoop Distributed File System (HDFS) and MapReduce. It begins by explaining HDFS architecture, including the NameNode and DataNodes. It then discusses how HDFS is used to store large files reliably across commodity hardware. The document also provides steps to install Hadoop in single node cluster and describes core Hadoop services like JobTracker and TaskTracker. It concludes by discussing HDFS commands and a quiz about Hadoop components.
This document outlines the key tasks and responsibilities of a Hadoop administrator. It discusses five top Hadoop admin tasks: 1) cluster planning which involves sizing hardware requirements, 2) setting up a fully distributed Hadoop cluster, 3) adding or removing nodes from the cluster, 4) upgrading Hadoop versions, and 5) providing high availability to the cluster. It provides guidance on hardware sizing, installing and configuring Hadoop daemons, and demos of setting up a cluster, adding nodes, and enabling high availability using NameNode redundancy. The goal is to help administrators understand how to plan, deploy, and manage Hadoop clusters effectively.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This document provides an overview and configuration instructions for Hadoop, Flume, Hive, and HBase. It begins with an introduction to each tool, including what problems they aim to solve and high-level descriptions of how they work. It then provides step-by-step instructions for downloading, configuring, and running each tool on a single node or small cluster. Specific configuration files and properties are outlined for core Hadoop components as well as integrating Flume, Hive, and HBase.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Big data refers to large and complex datasets that are difficult to process using traditional methods. Key challenges include capturing, storing, searching, sharing, and analyzing large datasets in domains like meteorology, physics simulations, biology, and the internet. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of computers. It allows for the distributed processing of large data sets in a reliable, fault-tolerant and scalable manner.
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionEdureka!
Hadoop is a highly available and secure enterprise data warehousing solution for big data. The document discusses that traditional data warehousing solutions like RDBMS are not suitable for big data due to challenges in storing and processing unstructured data at large scales. It then describes how Hadoop addresses these challenges through its scalable architecture and features like HDFS for storage, MapReduce for processing, and tools like Hive and HBase for data warehousing. The document also covers how Hadoop maintains high availability through techniques like secondary NameNodes and nameNode HA, and provides security through authorization and access controls.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speakers
Chris Douglas, Microsoft, Principal Research Software Engineer
Thomas Denmoor, Western Digital, Object Storage Architect
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
The document provides an agenda for a Hadoop/Big Data introductory session. The agenda covers introductions to big data concepts and Hadoop components like HDFS, MapReduce, Hive, HBase and Sqoop. It discusses working with HDFS and MapReduce, including file reads/writes in HDFS and MapReduce architecture, jobs and execution. Hands-on demos and code samples are proposed to supplement the theoretical content. The goal is to develop an understanding of big data theory and practice.
Enroll Free Live demo of Hadoop online training and big data analytics courses online and become certified data analyst/ Hadoop developer. Get online Hadoop training & certification.
This document provides an overview of Apache Hadoop, a framework for storing and processing large datasets in a distributed computing environment. It discusses what big data is and the challenges of working with large datasets. Hadoop addresses these challenges through its two main components: the HDFS distributed file system, which stores data across commodity servers, and MapReduce, a programming model for processing large datasets in parallel. The document outlines the architecture and benefits of Hadoop for scalable, fault-tolerant distributed computing on big data.
Similar to Best Practices for Deploying Hadoop (BigInsights) in the Cloud (20)
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
2. Please Note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
3. Agenda
Introduction
Optimizing for disk performance
Optimizing Java for computational performance
Optimizing MapReduce for computational performance
Optimizing with Adaptive MapReduce
Common considerations for
InfoSphere BigInsights and InfoSphere Streams
Questions and Answers
4. Prerequisites
To get the most out of this session, you should be familiar
with the basics of the following:
−
Hadoop and Streams
−
MapReduce
−
HDFS or GPFS
−
Linux shell
−
XML
5. My Team
IBM Information Management Cloud Computing Centre of
Competence
−
Information Management Demo Cloud
−
Deploy complete stacks of IBM software for demonstration
and evaluation purposes
imcloud@ca.ibm.com
Images and templates with IBM software for public clouds
IBM SmartCloud Enterprise
IBM SoftLayer
Amazon EC2
6. My Work
Development:
−
Ruby on Rails, Python, Bash/KSH shell scripting, Java
IBM SmartCloud Enterprise
−
−
Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2
RightScale and Amazon EC2
−
−
Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2
IBM PureApplication System
−
Private cloud appliance
−
DB2
7. Background
BigInsights recommendations are based on my experience
optimizing BigInsights Enterprise 2.1 performance on an
OpenStack private cloud
Streams recommendations are based on my experience
optimizing Streams 3.1 performance on IBM SmartCloud
Enterprise
Some recommendations are based on work with the IBM
Social Media Accelerator to process enormous amounts of
Twitter data using BigInsights and Streams
8. Hadoop Challenges in the Cloud
Hadoop does batch processing of data stored on disk.
The bottleneck is disk I/O.
Infrastructure-as-a-Service clouds have traditionally
focused on uses such as web servers that are optimized
for in-memory operation and have different constraints.
10. Disk Performance
Hadoop performance is I/O bound. It depends on disk
performance.
Hadoop is for batch processing of data stored on disks
Contrast with real-time and in-memory workloads (Streams,
Apache), which depend on memory and processor speed
Infrastructure-as-a-Service clouds (IaaS) were originally
optimized for in-memory workloads, not disk workloads
Cloud disk performance has traditionally been weak due to
virtualization abstraction and network separation between
computational units and storage
Different clouds have different solutions to this
11. Disk Performance – Choice of Cloud
Choice of cloud provider and instance type is crucial
Some cloud providers are worse for Hadoop than others
Favour local storage over network-attached storage (NAS)
−
For example, EBS on Amazon tends to be slower than local
storage
Options
−
SoftLayer and clouds of physical hardware
−
Storage-optimized instances on Amazon EC2
−
Other public and private clouds that keep storage as close to
computational nodes as possible
12. Disk performance – Concepts
Hadoop Distributed File System (HDFS) and General Parallel
File System (GPFS) are both abstractions
HDFS and GPFS run on top of disk filesystems
A disk is a device
A disk is divided into partitions
Partitions are formatted with filesystems
Formatted partitions can be mounted as a directory and used
to store anything
For Hadoop, we want Just-a-Bunch-Of-Disks (JBOD), not
RAID. HDFS has built-in redundancy.
Eschew Linux Logical Volume Manager (LVM).
13. Disk performance – Partitioning
We’ll use /dev/sdb as a sample disk name
Disks greater than 2TB in size require the use of a GUID
Partition Table (GPT) instead of Master Boot Record (MBR)
−
parted -s /dev/sdb mklabel gpt
For Hadoop storage, create a single partition per disk
Partition editor can be finicky about where that partition stops
and starts
−
−
end=$( parted /dev/sdb print free -m | grep sdb |
cut -d: -f2 )
parted -s /dev/sdb mkpart logical 1 $end
If you were working with disk /dev/sdb, you will now have a
partition called /dev/sdb1
14. Disk performance – Formatting
Many options: ext4, ext3, xfs
xfs is not included in base Red Hat Enterprise Linux (RHEL),
so assume ext4
−
mkfs -t ext4 -m 1 -O
dir_index,extent,sparse_super /dev/sdb1
“-m 1” reduces the number of filesystem blocks reserved for
root to 1%. Hadoop does not run as root.
“dir_index” makes listing files in a directory faster. Instead of
using a linked list, the filesystem will use a hashed B-tree.
“extent” makes the filesystem faster when working with large
files. HDFS divides data into blocks of 64MB or more, so
you’ll have many large files.
“sparse_super” saves space on large filesystems by keeping
fewer backups of superblocks. Big Data processing implies
large filesystems.
15. Disk performance – Mounting
Before you can access a partition, you have to mount it in an
empty directory
−
−
mkdir -p /disks/sdb1
mount -noatime -nodiratime /dev/sdb1 /disks/sdb1
“noatime” skips writing file access time to disk every time a
file is accessed
“nodiratime” does the same for directories
In order for the system to re-mount your partition after reboot,
you also have to add it to the /etc/fstab configuration file
−
echo "/dev/sdb1 /disks/sdb1 ext4
defaults,noatime,nodiratime 1 2" >> /etc/fstab
16. HDFS Data Storage on Multiple Partitions
Don’t forget that you can spread HDFS across multiple
partitions (and so disks) on a single system
In the cloud, the root partition / is usually very small. You
definitely don’t want to store Big Data on it.
Don’t use the root of a mounted filesystem (e.g. /disks/sdb1)
as the data path. Create a subdirectory (e.g.
/disks/sdb1/data)
−
mkdir -p /disks/sdb1/data
Otherwise, HDFS will get confused by things Linux puts in
the root (e.g. /disks/sdb1/lost+found)
17. HDFS Data Storage – Installation and Timing
You can set HDFS data storage path during installation or
after installation.
BigInsights has a fantastic installer for Hadoop – offers both
a web-based graphical installer, and a powerful silent install
for response file.
Web-based graphical installer will generate a silent install
response file for you for future automation.
BigInsights also comes with sample silent install response
files.
18. HDFS Data Storage – During installation
During installation, HDFS data storage path is controlled by
the values of <hdfs-data-directory /> and <data-directory />
For example:
−
<cluster-configuration>
<hadoop><datanode><data-directory>
−
/disks/sdb1/data,/disks/vdc1/data
</data-directory></datanode></hadoop>
<node-list><node><hdfs-data-directory>
−
−
/disks/sdb1/data,/disks/vdc1/data
</hdfs-data-directory></node></node-list>
</cluster-configuration>
19. HDFS Data Storage – During Installation (2)
Multiple paths are separated by commas
Any path with an omitted initial / is considered relative to the
installation’s <directory-prefix />
If <directory-prefix/> is “/mnt”, then the <hdfs-data-directory/>
“hadoop/data” would be interpreted as “/mnt/hadoop/data”
You can mix relative and absolute paths in the commaseparated list of directories
20. HDFS Data Storage – After Installation
You can change the path of HDFS data storage after
installation
Path is controlled by dfs.data.dir variable in hdfs-site.xml
In Hadoop 2.0, dfs.data.dir is renamed to
dfs.datanode.data.dir
Note: With BigInsights, never modify configuration files in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
−
Modify $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hdfssite.xml
−
Then run synconf.sh to apply the configuration setting across
the cluster
echo 'y' | syncconf.sh hadoop force
Note: Never reformat data nodes in BigInsights. Reformatting
will erase BigInsights libraries from HDFS.
21. HDFS Namenode Storage
The Namenode of a Hadoop cluster stores the locations of all
the files on the cluster
During installation, the path of this storage is determined by
the value of <name-directory />
After installation, the path of namenode storage is
determined by the value of dfs.name.dir variable in hdfssite.xml
You can separate multiple locations with commas
In Hadoop 2.0, dfs.name.dir is renamed to
dfs.namenode.name.dir
23. Java and Computational Performance
BigInsights and Hadoop are Java-based
Configuration the Java Virtual Machine (JVM) correctly is
crucial to processing of Big Data in Hadoop
Correct JVM configuration depends on both the machine as
well as the type of data
BigInsights has a configuration preprocessor that will easily
size the configuration to match the machine
24. Java and Computational Performance
Note: Never modify mapred-site.xml in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
Modify mapred-site.xml in
$BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/
Run syncconf.sh to process the calculations and apply the
new configuration to the cluster
25. Java and Computational Performance
A key property for performance is the amount of memory
allocated to each Java process or task
Keep in mind many tasks will be running at the same time,
and you’ll want them all to fit within available machine
memory with some margin
A good value for many use cases is 600m
−
<property>
−
<name>mapred.child.java.opts</name>
<value>-Xmx600m</value>
</property>
When working with the IBM Social Media Accelerator, you’ll
want much more memory per task. 4096m or more is
common, with implications for size of machine expected.
Note: Do not enable -Xshareclasses. This was a bad default
in older BigInsights releases.
26. Java and Computational Performance –
Streams
Streams and Streams Studio are Java applications
You can increase the amount of memory allocated to the
Streams Web Server (SWS) as follows, where X is in
megabytes:
−
−
streamtool stopinstance --instance-id myinstance
−
streamtool setproperty --instance_id myinstance
SWS.jvmMaximumSize=X
streamtool startinstance --instance-id myinstance
You can increase the amount of memory for Streams Studio
in <install-directory>/StreamsStudio/streamsStudio.ini
−
After -vmargs, add -Xmx1024m or similar
27. MapReduce and Computational Performance
Hadoop traditionally uses the MapReduce algorithm for
processing Big Data in parallel on a cluster of machines
Each machine runs a certain number of Mappers and
Reducers
A Hadoop Mapper is a task that splits input data into
intermediate key-value pairs
A Hadoop Reducer is a task that that reduces a set of
intermediate key-value pairs with a shared key to a smaller
set of avlues
28. MapReduce and Computational Performance
You’ll want more than one reduce tasks per machine, with
both the number of available cores and the amount of
available memory constricting the number you can have
The 600 denominator comes from the value for JVM memory
in mapred.child.java.opts
−
<property>
−
<name>mapred.reduce.tasks</name>
<value><%= Math.ceil(numOfTaskTrackers *
avgNumOfCores * 0.5 * 0.9) %></value>
</property>
29. MapReduce and Computational Performance
Map tasks and reduce tasks use the machine differently. Map
tasks will fetch input locally, while reduce tasks will fetch
input from the network. They will run at the same time.
Running more tasks than will fit in a machine’s memory will
cause tasks to fail.
Set the number of map tasks per machine to use slightly less
than half the number of available processor cores
−
−
<name>tasktracker.map.tasks.maximum</name>
<value><%= Math.min(Math.ceil(numOfCores *
1.0),Math.ceil(0.8*0.66*totalMem/600)) %></value>
Set the number of reduce tasks per machine to half the
number of map tasks
−
<name>tasktracker.map.tasks.maximum</name>
−
<value><%= Math.min(Math.ceil(numOfCores *
0.5),Math.ceil(0.8*0.33*totalMem/600)) %></value>
30. MapReduce and Computational Performance
Cloud machine size
Number of mappers
Number of reducers
1 core, 2GB
1
1
1 core, 4GB
1
1
2 core, 8GB
2
1
4 core, 15GB
4
2
16 core, 61GB
16
8
16 core, 117GB
16
8
31. More options in mapred-site.xml
“mapred.child.ulimit” lets you control virtual memory used by
Hadoop’s Java processes. 1.5x the size of mapred-childjava-opts is a good. Note that the value is in kilobytes. If the
Java options are “-Xmx600m”, then a good value for the
ulimit is 600*1.5*1024 which is “921600”.
“io.sort.mb” controls the size of the output buffer for map
tasks. When it’s 80% full, it will start being written to disk.
Increasing the size of the output buffer will reduce the
number of separate writes to disk. Increasing the size will use
more memory and do less disk I/O.
“io.sort.factor” defines the number of files that can be merged
at one time. Merging is done when a map tasks is complete,
and again before reducers start executing your analytic code.
Increasing the size will use more memory and do less disk
I/O.
32. More options in mapred-site.xml (2)
“mapred.compress.map.output” enables compression when
writing the output of map tasks. Compression used more
processor capacity but reduces disk I/O. Compression
algorithm is determined by
“mapred.map.output.compression.codec”
“mapred.job.tracker.handler.count” determines the size of the
thread pool for responding to network requests from clients
and tasktrackers. A good value is the natural logarithm (ln) of
cluster size times 20. “dfs.namenode.handler.count” should
also be set to this, as it performs the same functions for
HDFS.
“mapred.jobtracker.taskScheduler” determines the algorithm
used for assigning tasks to task trackers. For production,
you’ll want something more sophisticated than the default
JobQueueTaskScheduler.
33. Kernel Configuration
Linux kernel configuration is stored in /etc/sysctl.conf
“vm.swappiness” controls kernel’s swapping of data from
memory to disk. You’ll want to discourage swapping to disk,
so 0 is a good value.
“vm.overcommit_memory” allows more memory to be
allocated than exists on the system. If you experience
memory shortages, you may want to set this to 1 as the way
the JVM spawns Hadoop processes will have them request
more memory than they need. Further tuning is done through
“vm.overcommit_ratio”.
35. IBM Big Data Platform
IBM InfoSphere BigInsights
Visualization & Discovery
Administration
Applications & Development
BigSheets
Apps
Workflow
Dashboard &
Visualization
Text Analytics
Pig & Jaql
MapReduce
Hive
Admin Console
Integration
JDBC
Monitoring
Netezza
Advanced Analytic Engines
R
Text Processing Engine &
Extractor Library)
Adaptive Algorithms
DB2
Streams
Workload Optimization
Integrated
Installer
Enhanced
Security
Splittable Text
Compression
Adaptive
MapReduce
ZooKeeper
Oozie
Jaql
Flexible
Scheduler
Lucene
Pig
Hive
Index
Runtime / Scheduler
MapReduce
Symphony
Symphony AE
DataStage
HCatalog
Management
Security
Data Store
Guardium
Platform
Computing
Cognos
Audit & History
HBase
Flume
Lineage
File System
HDFS
Sqoop
GPFS FPO
Open Source
IBM
Optional
36. Adaptive MapReduce
Adaptive MapReduce lets mappers communicated through a
distributed metadata store and take into account the global
state of the job
Open the install.properties before you install BigInsights
To Enable Adaptive MapReduce, set the following:
−
To also enable High Availability, set the following:
−
AdaptiveMR.Enable=true
AdaptiveMR.HA.Enable=true
High Availability requires at least nodes in your cluster
Adaptive MapReduce is a single-tenant implementation of
IBM Platform Symphony
38. Common Considerations
Both BigInsights and Streams rely on working with large
numbers of open files and running processes
Raise the Linux limit on the number of open files (“nofile”) to
131072 or more in /etc/security/limits.conf
Raise the Linux limit on the number of processes (“nproc”) to
unlimited in /etc/security/limits.conf
Remove RHEL forkbomb protection from
/etc/security/limits.d/90-nproc.conf
Validate your changes with a fresh login as your BigInsights
and Streams users (e.g. biadmin, streamsadmin) and the
ulimit command
41. Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content
Management communities
•
ibm.com/champion
42. Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite
Editor's Notes
On this chart, you can get a quick overview of the various open source and IBM technologies provided with BigInsights Enterprise Edition. Open source technologies are shown in yellow, when IBM-specific technologies are shown in blue