Deep dive into HDFS Tiering with Dell EMC Isilon for Hadoop/Big Data. Covers MapReduce, Hive, and Spark use cases. Also incldues TPCDS Performance comparisons between Direct Attached Storage and Isilon Scale-out NAS Gen5 and Gen 6 models.
The slides are created for the "Hadoop User Group Vienna", a Meetup that gathers Hadoop users in Vienna on September 6, 2017. The content of the slides correspond to the first talk, which discussed the concepts, terminology and disaster recovery capabilities in the Hadoop ecosystem.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
The slides are created for the "Hadoop User Group Vienna", a Meetup that gathers Hadoop users in Vienna on September 6, 2017. The content of the slides correspond to the first talk, which discussed the concepts, terminology and disaster recovery capabilities in the Hadoop ecosystem.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
In this webinar, we'll:
-Examine the key drivers and use cases for High Availability, performance and scalability for Apache Hadoop.
-Walk through an overview of reference architecture for a Non-Stop Hadoop implementation.
-Show how you can get started with Non-Stop Hadoop with the Hortonworks Data Platform.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
EMC Isilon Best Practices for Hadoop Data StorageEMC
This paper describes the best practices for setting up and managing the HDFS service on an EMC Isilon cluster to optimize data storage for Hadoop analytics. This paper covers OneFS 7.2 or later.
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
In this webinar, we'll:
-Examine the key drivers and use cases for High Availability, performance and scalability for Apache Hadoop.
-Walk through an overview of reference architecture for a Non-Stop Hadoop implementation.
-Show how you can get started with Non-Stop Hadoop with the Hortonworks Data Platform.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
EMC Isilon Best Practices for Hadoop Data StorageEMC
This paper describes the best practices for setting up and managing the HDFS service on an EMC Isilon cluster to optimize data storage for Hadoop analytics. This paper covers OneFS 7.2 or later.
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
This presentation about Hadoop training will help you understand the need for Hadoop, what is Hadoop and concepts including Hadoop ecosystem, Hadoop features, how HDFS works, what is MapReduce and how YARN works. Finally, we will implement a banking case study using Hadoop. To solve the issue of rapidly increasing data, we need big data technologies such as Hadoop, Spark, Storm, Cassandra and many more. Hadoop can store and process vast volumes of data. You will understand the architecture of HDFS, MapReduce workflow and the architecture of YARN. In the demo, you will learn in detail on how to export data from RDBMS (MySQL) into HDFS using Sqoop commands. Now, let us get started and gain expertise with Hadoop training video.
Below topics are explained in this Hadoop training presentation:
1. Need for Hadoop
2. What is Hadoop
3. Hadoop ecosystem
4. Hadoop features
5. What is HDFS
6. What is MapReduce
7. What is YARN
8. Bank case study
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
EMC Isilon Best Practices for Hadoop Data StorageEMC
This white paper describes the best practices for setting up and managing the HDFS service on an Isilon cluster to optimize data storage for Hadoop analytics.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
This talk will cover utilizing native Hadoop storage policies and types to effectively archive and tier data in your existing Hadoop infrastructure. Key focus areas are:
1. Why use heterogeneous storage (tiering)?
2. Identifying key metrics for successful archiving of Hadoop data
3. Automation requirements at scale
4. Current limitations and gotchas
The impact of successful archive provides Hadoop users better performance, lower hardware cost, and lower software costs. This session will cover the techniques and tools available to unlock this powerful capability in native Hadoop.
Speakers:
Peter Kisich works with multiple large scale Hadoop customers successfully tiering and optimizing Hadoop infrastructure. He co-founded FactorData to bring enterprise storage features and control to open Hadoop environments. Previously, Mr. Kisich served as a global subject matter expert in Big Data and Cloud computing for IBM including speaking at several global conferences and events.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
20+ Million Records a Second - Running Kafka on Isilon F800 Boni Bruno
This paper describes performance test results for running Kafka with Dell EMC Isilon F800 All-Flash NAS Storage. A comparison against direct attached storage is also provided.
This presentation discusses the benefits of merging NPM & APM together to better assist problem response teams in troubleshooting network and application problems.
The presentation highlights a new product offering called NetPod which is a joint solution developed between Emulex and Dynatrace.
This presentation has been well received the the SANS community and many information security teams I engage with.
It describes how integrating a full content repository to your existing security architecture can decrease incident response time and lead to fast identification of root cause.
I also describe a new way of implementing NetFlow without sampling to provide greater visibility of your network.
Enjoy!
Boni Bruno, CISSP, CISM, CGEIT
www.bonibruno.com
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. 1 of 35
Optimizing your HADOOP
Infrastructure with Hortonworks
And Dell EMC
Deep Dive into HDFS Tiering your data
across DAS and Isilon
Boni Bruno, CISSP, CISM
Chief Solutions Architect
2. 2 of 35
Growing Hadoop Data Volumes: Not all data has the same characteristics
Cost Perf
Cold Data
Goal: Low Price/TB
Cost Perf
Hot Data
Goal : High Throughput/TB
• Data in Long term retention
• Accessed/Analyzed occasionally
• Ex: Long term medical records, bank
records, research data, phone logs etc.
• Recently generated data
• Accessed/Analyzed frequently
• Ex: New medical records, bank records,
research data, phone logs etc.
Hot + Cold Data
IT Budget
Hot Data
Time
TB
Significant portion of Data Growth in majority of Enterprises comes from “Cold Data”
Gap
75TB inflection
point
Data Growth
Larger Hadoop Clusters
More Cost
• Servers
• Maintenance
• Floor Space, Power
• OS and Hadoop Software
More Risk
• Hardware Failure
• Security
• Compliance
3. 3 of 35
Expanding Hadoop from 100TB to 1PB Usable Capacity
Hadoop DAS Cluster
100TB Usable Capacity
• 12 R730XD Servers
• 15 RHEL Subscriptions
• 2 Racks
• 4 HDP Ent+ Subscriptions
Hadoop DAS Cluster
1PB Usable Capacity
• 82 R730XD Servers
• 85 RHEL Subscriptions
• 7 Racks
• 22 HDP Ent+ Subscriptions
Cold Data = Low CPU Utilization
Cold
Data
Large Hadoop Clusters = Up to 80% Cold data
Server Failure Probability = 1.49% Server Failure Probability = 8.15%
1
3
2
4. 4 of 35
Hortonworks Hadoop Tiered Storage with Isilon
• Shared Storage Overlay for “Cold Data”
• Two Hadoop Namespaces : DAS and Shared Storage
• All data (DAS + Shared) analyzed by Hadoop apps
DAS Cluster
Shared
Storage Cluster
Namespace hdfs://DAS
Namespace hdfs://Isilon
Physical Config
Hot Data
Cold Data
< 75TB
> 75TB
Logical Config
Hadoop Yarn-based Apps:
Hive, Spark and Map-Reduce
Hot Data in DAS
Cluster
(hdfs://DAS)
Cold Data in
Isilon Shared
Storage Cluster
(hdfs://Isilon)
Ambari,Ranger,
Knox,Atlas
NativeIsilonData
ManagementTools
Improved Accuracy
Deeper AnalyticsMore Data
5. 5 of 35
Hortonworks Hadoop Tiered Storage Benefits
• No disruption to existing Hadoop DAS cluster
• No migration of data out of existing Isilon
• Analyze long term data with compliance, protection, security
• HA for long term data and Hadoop name node
Minimize Business Risk and reduce TCO in managing explosive growth of cold data in Hadoop
DAS Cluster
Shared
Storage Cluster
DAS NameNode/DataNodes
Add Compute for Shared storage
Isilon HA NameNode/DataNodes
Carve out Hadoop
Access Zone
1
2
3
4
1
2
3
4
Hardware Maintenance Floor Space
OS Software Hadoop Software
Grow Compute slower than data. Save on :
6. 6 of 35
Hadoop DAS Cluster
• 82 R730XD Servers
• 85 RHEL Subscriptions
• 7 Racks
• 22 HDP Ent+ Subscriptions
Cold Data = Low CPU Utilization
Server Failure Probability = 4.12%Server Failure Probability = 8.15%
1
2
Hadoop Shared Storage
Cold Data in Isilon
• 42 R730XD Servers
• 45 RHEL Subscriptions
• 6 Racks
• 12 HDP Ent+ Subscriptions
• 20 Isilon nodes (12:H400, 8:A200)
2
1
• Up to 35% HW Acquisition Cost
• Up to 41% HW 3-Yr Cost
• Up to 48% SW License Cost
• Up to 30% in Rack Units
Example: Expanding Hadoop Cluster by 1PB Usable Capacity
Savings
7. 7 of 35
Hadoop Tiered Storage Implementation Services
Transition: Single Namespace → Dual Namespace
Connect solution to business metrics and TCO
Define and implement reusable best practices
Goal: Assist customers in expanding from a Hadoop on DAS solution to the Tiered Storage Solution
Best practices
Guidelines for Cold vs. Hot Data
Security in dual namespace environment
Hadoop app config for dual namespace
1. Technology Advisory Service
2. Data Migration Service
8. 8 of 35
Implementation Services: Sample Success Stories
Supply Chain
Optimization
Global IT Provider
Reducing
Fraud and Waste
Global Pharmaceutical Provider
Cell Tower Analysis
Mobile carrier
Predicting and
avoiding ATM Thefts
Large South American Bank
Enhancing Workflow
and Resource
Management
Major US City Sanitation
Department
Enabling value added
Customer Services
Global Engineering Firm
Financial Services | Aviation | Healthcare | Education | Telecom | Retail | Utilities | Federal | Entertainment
10. 10 of 35
Key Solution Attributes
Supports Spark, Hive and MapReduce across Isilon and DAS namespaces
No need to configure Ambari agent on Isilon
Works in both Kerberized and non-kerberized environments
HDFS Tiering also works with Ranger
Common Hive meta-store across Isilon and DAS namespaces
High Speed HBase and Kudu are run on DAS cluster
Allows for easy separation of HDFS storage without adding compute
Provides better TCO for active archiving
11. 11 of 35
HDFS Tiering with Isilon also works with Ranger!
HDFS Tiering has been validated with HDFS, HIVE, and YARN policies enabled on DAS Cluster!
12. 12 of 35
Enable Ranger Plugin on Isilon
After creating your Ranger policies on DAS, simply enable the Ranger Plugin on Isilon.
Note: Requires OneFS 8.0.1.x or 8.1.0.x
13. 13 of 35
Example Ranger HDFS Access
Policy for DAS & Isilon
Creating and assigning new access policies
1. Create sample directories such as GRANT_ACCESS and
RESTRICT_ACCESS on the Isilon HDFS cluster.
2. Create hdp-user1 on all the nodes of the HDP cluster and
Isilon cluster.
3. In the Ranger UI under HDP3_hadoop Service Manager,
assign Read/Write/Execute (RWX) access for the hdp-user1
on GRANT_ACCESS
14. 14 of 35
Testing Ranger HDFS Access Policy with Remote
Isilon File System
1. Assign RWX access to hdp-user1 in the GRANT_ACCESS directory and deny RWX access in the
RESTRICT_ACCESS directory. (Shown in previous slides)
2. Access the GRANT_ACCESS directory with different hdfs commands, verify there are no permission
issues:
hadoop fs -ls hdfs://isi.yourdomain.com:8020/GRANT_ACCESS
hadoop fs -put /etc/redhat-release hdfs://isi.yourdomain.com:8020/GRANT_ACCESS/
hadoop fs -ls hdfs://isi.yourdomain.com:8020/GRANT_ACCESS
Found 1 items
-rw-r--r-- 3 hdp-user1 hadoop 52 2017-08-24 12:12 hdfs://isi.yourdomain.com:8020/GRANT_ACCESS/redhat-release
SUCCESS!
15. 15 of 35
Example Ranger HDFS
Deny Policy for DAS & Isilon
Creating and assigning new access policies
1. Create sample directories such as GRANT_ACCESS and
RESTRICT_ACCESS on the Isilon HDFS cluster.
2. Create hdp-user1 on all the nodes of the HDP cluster and
Isilon cluster.
3. In the Ranger UI under HDP3_hadoop Service Manager,
assign Read/Write/Execute (RWX) access for the hdp-user1
on RESTRICT_ACCESS
16. 16 of 35
Testing Ranger HDFS Deny Policy with Remote Isilon
File System
1. Add the user hdp-user1 to the Ranger RESTRICT_ACCESS directory policy on the remote Isilon HDFS.
2. Test access is restricted:
hadoop fs -put /etc/redhat-release hdfs://isi.yourdomain.com:8020/RESTRICT_ACCESS/
17/08/24 12:18:34 WARN retry.RetryInvocationHandler: Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(org.apache.ranger.authorization.hadoop.exceptions.RangerAccessControlException): Permission
denied: user=hdp-user1@YOURDOMAIN.COM, access=EXECUTE, path="/RESTRICT_ACCESS"
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
at org.apache.hadoop.ipc.Client.call(Client.java:1496)
at org.apache.hadoop.ipc.Client.call(Client.java:1396)
.
.
at org.apache.hadoop.fs.FsShell.main(FsShell.java:350)
put: Permission denied: user=hdp-user1@YOURDOMAIN.COM, access=EXECUTE, path="/RESTRICT_ACCESS"
SUCCESS!
17. 17 of 35
Extensive Testing of
Ranger Hive Policies
with Isilon
Test case name Step Description
Hive data
warehouse
Ranger policy
setup
1 Assign RWX on /user/hive directory for hdp-user1 on HDP (local
DAS HDFS) and Isilon cluster using Ranger UI
DDL operations
1. LOAD DATA
LOCAL inpath
2. INSERT into
table
3. INSERT
Overwrite
TABLE
1 Drop remote database if EXISTS cascade
2 Create remote_db, hive warehouse resides on remote Isilon HDFS
3 Create internal nonpartitioned table on remote_db
4 LOAD data local inpath into table created in preceding step
5 Create internal nonpartitioned table data on remote Isilon HDFS
6 LOAD data local inpath into table created in preceding step
7 Create internal transactional table on remote_db
8 INSERT into table from internal nonpartitioned table
9 Create internal partitioned table on remote_db
10 INSERT OVERWRITE TABLE from internal nonpartitioned table
11 Create external nonpartitioned table on remote_db
12 Drop local database if EXISTS cascade
13 Create local_db, hive warehouse resides on local DAS Hadoop
cluster
14 Create internal nonpartitioned table on local_db
15 LOAD data local inpath into table created in preceding step
16 Create internal nonpartitioned table data on local DAS Hadoop
cluster
17 LOAD data local inpath into table created in preceding step
18 Create internal transactional table on local_db
19 INSERT into table from internal nonpartitioned table
20 Create internal partitioned table on local_db
21 INSERT OVERWRITE TABLE from internal nonpartitioned table
22 Create external nonpartitioned table on local_db
DML operations
1. Query local
database
tables
2. Query remote
database
tables
1 Query data from local external nonpartitioned table
2 Query data from local internal nonpartitioned table
3 Query data from local nonpartitioned remote data table
4 Query data from local internal partitioned table
5 Query data from local internal transactional table
6 Query data from remote external nonpartitioned table
7 Query data from remote internal nonpartitioned table
8 Query data from remote nonpartitioned remote data table
All test cases to the right
successfully passed with HDFS
tiering to Isilon using a single
metastore on DAS cluster with
Hive Ranger Policies enabled.
18. 18 of 35
Using Hive with Hadoop Tiered Storage
The Dell EMC® Isilon® HDFS tiering
solutions allows for a common Hive
Metastore across DAS and Isilon clusters.
There is no need to maintain separate
Metastores with HDFS tiering, by simply
creating external databases, tables, or
partitions that specify Isilon as the remote
filesystem location in Hive, users can
transparently access remote data on Isilon.
This is a powerful use case!
Note: You still need to maintain backups of your Hive DB as a best practice!
19. 19 of 35
Creating a remote Hive database on Isilon
Note: Hive CLI uses Hive Server 1, Hortonworks is focusing on Hive Server 2 now
so start using beeline instead of Hive CLI (both still work). Keep in mind, there is
only one metastore and it’s on the DAS cluster. All the Hive work is done on DAS.
Run the Hive CLI or beeline client to create a remote database location on Isilon:
>CREATE database remote_DB COMMENT ‘Remote Database' LOCATION
'hdfs://isi.yourdomain.com:8020/user/hive/remote_DB'
OK
Time taken: 0.045 seconds
20. 20 of 35
Creating a remote Hive table on Isilon and loading data
Create an internal nonpartitioned table and load data using local inpath:
USE remote_DB;
OK
Time taken: 0.036 seconds
CREATE TABLE passwd_int_nonpart (user_name STRING, password STRING, user_id STRING,
group_id STRING, user_id_info STRING, home_dir STRING, shell STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ':‘;
OK
Time taken: 0.211 seconds
LOAD data local inpath '/etc/passwd' into TABLE passwd_int_nonparty;
Loading data to table remote_db.passwd_int_nonpart
Table remote_db.passwd_int_nonpart stats: [numFiles=1, numRows=0, totalSize=1808, rawDataSize=0]
OK
Time taken: 0.261 seconds
21. 21 of 35
Hive Tiering Performance with Isilon Gen5 & Gen6
In additional to the functional testing of Hive in an HDFS Tiering setup, performance
of both x410 (Gen 5) and H600 (Gen 6) were tested using a TPCDS (3 TB Dataset).
The 3TB data set was first generated on the DAS cluster (6 Nodes – 1 MASTER, 5
WORKERS, each with 40cores, 256G RAM, and 12 x 1TB drives excluding OS
drives).
The 3TB TPCDS set was generated again on two remote Isilon cluters:
• 7 x X410 nodes (Isilon X410-4U-Dual-256GB-2x1GE-2x10GE SFP+-96TB-3277GB SSD) running OneFS
8.0.1.1
• 1 H600 Chassis (4 nodes of Isilon H600-4U-Single-256GB-1x1GE-2x40GE SFP+-36TB-6554GB SSD)
running OneFS 8.1.0.0
[Results of performance test on next slide]
22. 22 of 35
HDFS Tiered Storage Performance – DAS vs Gen5 vs Gen6
HDP 2.5.3.99 - HiveServer2 (No LLAP) Benchmarks
31
12
15
21
19 20
23
42
62
29
14 13
22 22 21
30
59
64
30
13 12
19
15 15
18
37
59
Q3 Q12 Q20 Q42 Q52 Q55 Q73 Q89 Q91
Time(S)
Query Number
TPCDS (3TB) HDFS Tiered Storage Results
das (5) x410 (7) h600 (4)
All queries ran from the same DAS Compute Nodes! The difference is where that data resides (local disks or on Isilon). Here you see H600 (less nodes) outperforming DAS!
23. 23 of 35
Moving Data with DistCp (use –skipcrccheck with Isilon)
Moving data between DAS and Isilon with DistCp works fine (both Non-Kerberos and Kerberos tested).
Note: As with any Hadoop cluster, data must not be active on source when using DistCp to move data.
Example below uses DistCp to copy a file from the local DAS cluster to the remote Isilon cluster.
$ hadoop distcp -skipcrccheck -update /tmp/redhat-release hdfs://isi.yourdomain.com/tmp/
$ hdfs dfs -ls -R hdfs://isi.yourdomain.com:8020/tmp/redhat-release
-rw-r--r-- 3 root hdfs 27 2017-09-07 13:26 hdfs://isilon.solarch.lab.emc.com:8020/tmp/redhat-release
Now from the remote Isilon cluster to the local DAS cluster.
$ hadoop distcp -pc hdfs://isi.yourdomain.com:8020/tmp/redhat-release
hdfs://das.yourdomain.com:8020/tmp/
$ hdfs dfs -ls -R hdfs://das.yourdomain.com:8020/tmp/redhat-release
-rw-r--r-- 3 hdfs hdfs 27 2017-09-07 13:26 hdfs://hdp-master.bigdata.emc.local:8020/tmp/redhat-release
24. 24 of 35
Using MapReduce with HDFS Tiering
You can run MapReduce jobs using Isilon as an source or destination, just specify the hdfs path in the MR job.
Let’s run a MapReduce WordCount job using the local DAS cluster as a source input and the remote Isilon cluster
as the destination output as an example:
$ yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount
hdfs://das.yourdomain.com/tmp/mr/redhat-release hdfs://isi.yourdomain.com/tmp/mr/redhat-release-das
$ hdfs dfs -ls -R hdfs://isi.yourdomain.com/tmp/mr/redhat-release-das
-rw-r--r-- 3 ambary-qa hdfs 0 2017-08-04 01:49 hdfs://isi.yourdomain.com/tmp/mr/redhat-release-das/_SUCCESS
-rw-r--r-- 3 ambary-qa hdfs 68 2017-08-04 01:49 hdfs://isi-cluster-hdfs2.bigdata.emc.local:8020/tmp/mr/redhat-release-das-hdfs2/part-r-00000
Now the reverse, WordCount job on input from the remote Isilon cluster with output going to the local DAS cluster:
$ yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount
hdfs://isi.yourdomain.com/tmp/mr/redhat-release hdfs://das.yourdomain.com/tmp/mr/redhat-release-isi
$ hdfs dfs -ls -R hdfs://das.yourdomain.com/tmp/mr/redhat-release-isi
-rw-r--r-- 3 ambary-qa hdfs 0 2017-08-04 09:50 hdfs://hdp-master03.bigdata.emc.local:8020/tmp/mr/redhat-release-isi/_SUCCESS
-rw-r--r-- 3 ambary-qa hdfs 68 2017-08-04 09:50 hdfs://hdp-master03.bigdata.emc.local:8020/tmp/mr/redhat-release-isi/part-r-00000
25. 25 of 35
Using Spark with HDFS Tiering
Let’s create a word count and line count Scala file for Spark testing:
cat >/tmp/spark_line_word_count.scala <<EOF
val args=sc.getConf.get("spark.driver.args").split("s+")
var input=args(0)
var output1=args(1) + "-wc"
var text_file=sc.textFile(input)
val word_count=text_file.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
word_count.saveAsTextFile(output1)
var output2=args(1) + "-lc"
var line_count=sc.parallelize(Seq(text_file.count()))
line_count.saveAsTextFile(output2)
exit
EOF
26. 26 of 35
Using Spark with HDFS Tiering - continued
Using the file we just created, we can run a Spark shell to do a word count and line count
(as an example) on input from the local DAS cluster with output going to the remote Isilon
cluster and vice-versa. Commands shown below:
$ spark-shell -i /tmp/spark_line_word_count.scala --conf
'spark.driver.args=hdfs://das.yourdomain.com/tmp/spark/redhat-release
hdfs://isi.yourdomain.com/tmp/spark/redhat-release-das'
$ spark-shell -i /tmp/spark_line_word_count.scala --conf
'spark.driver.args=hdfs://isi.yourdomain.com/tmp/spark/redhat-release-das
hdfs://das.yourdomain.com/tmp/spark/redhat-release-isilon'
27. 27 of 35
Next Steps
2. Solution white-boarding
and proof-point discussion
with subject matter expert
4. Discuss Implementation
Services
3. Discuss Cost and TCO
Analysis
1. Assess your data
and workloads
Improve
Outcomes
Control
Costs
Minimize
Risk
$