Hadoop framework is often built on native environment with commodity hardware as its original design. However, with growing tendency of cloud computing, there is stronger requirement to build hadoop cluster on a public/private cloud in order for customers to benefit from virtualization and multi-tenancy. My speech want to introduce some challenges to provide hadoop service on virtualization platform like: performance, rack awareness, job scheduling, memory over commitment, etc and propose some solutions.
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
Hadoop has made it into the enterprise mainstream as Big Data technology. But, what about Hadoop as a private or public cloud service on a shared infrastructure? This session looks at a Hadoop solution with virtualization, shared storage, and multi-tenancy, and discuss how service providers can use Pivotal Hadoop Distribution, Isilon, and Serengeti to offer Hadoop-as-a-Service.
Objective 1: Understand Hadoop and its deployment challenges.
After this session you will be able to:
Objective 2: Understand the EMC HDaaS solution architecture and the use cases it addresses.
Objective 3: Understand Pivotal Hadoop Distribution, Serengeti and Isilon's Hadoop features.
Hadoop as a Service ( as offered by handful of niche vendors now ) is a cloud computing solution that makes medium and large-scale data processing accessible, easy, fast and inexpensive. This is achieved by eliminating the operational challenges of running Hadoop, so one can focus on business growth.
Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
http://bit.ly/1wsAuRS - There are many hidden costs for Apache Hadoop that have different effects across different Hadoop distributions. With the new MapR TCO calculator organisations have a simple and reliable tool that is based on facts to compare costs.
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
Hadoop has made it into the enterprise mainstream as Big Data technology. But, what about Hadoop as a private or public cloud service on a shared infrastructure? This session looks at a Hadoop solution with virtualization, shared storage, and multi-tenancy, and discuss how service providers can use Pivotal Hadoop Distribution, Isilon, and Serengeti to offer Hadoop-as-a-Service.
Objective 1: Understand Hadoop and its deployment challenges.
After this session you will be able to:
Objective 2: Understand the EMC HDaaS solution architecture and the use cases it addresses.
Objective 3: Understand Pivotal Hadoop Distribution, Serengeti and Isilon's Hadoop features.
Hadoop as a Service ( as offered by handful of niche vendors now ) is a cloud computing solution that makes medium and large-scale data processing accessible, easy, fast and inexpensive. This is achieved by eliminating the operational challenges of running Hadoop, so one can focus on business growth.
Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
http://bit.ly/1wsAuRS - There are many hidden costs for Apache Hadoop that have different effects across different Hadoop distributions. With the new MapR TCO calculator organisations have a simple and reliable tool that is based on facts to compare costs.
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
3 Things to Learn About:
*On-premises versus the cloud
*Design & benefits of real-time operational data in the cloud
*Best practices and architectural considerations
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
3 Things to Learn About:
* The concept of lambda architectures
* The Hadoop ecosystem components involved in lambda architectures
* The advantages and disadvantages of lambda architectures
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...DataWorks Summit
Scheduler of a container orchestration system, such as YARN and K8s, is a critical component that users rely on to plan resources and manage applications.
And if we assess where we are today, in YARN effectively it had two power schedulers (Fair and Capacity scheduler) and both serve many strong use cases in big data ecosystem. It can scale up to 50k nodes per cluster, and schedule 20k containers per second, and extremely efficient to manage batch workloads.
K8s default scheduler is an industry-proven solution to efficiently manage long-running services. As more big data apps are moving to K8s and cloud world, but many features like hierarchical queues to support multi-tenancy better, fairness resource sharing, and preemption, etc. are either missing or not mature enough at this point of time to support big data apps running on K8s.
At this point, there is no solution that exists to address the needs of having a unified resource scheduling experiences across platforms. That makes it extremely difficult to manage workloads running on different environments, from on-premise to cloud.
Hence evolving a common scheduler powered from YARN and K8s’s legacy capabilities and improving towards cloud use cases will focus more on use cases like:
Better bin-packing scheduling (and gang scheduling)
Autoscale up and shrink policy management
Effectively run batch workloads and services with clear SLA’s
In summary, we are improving core scheduling capabilities to manage both K8s and YARN cluster which is cloud aware as a separate initiative and above-mentioned cases will be the core focus of this initiative. More details of our works will be presented in this talk.
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMAmazon Web Services
Are you ready to streamline your SAP processes?
The pace of change in business and IT is relentless. To stay on top, enterprise leaders must focus on agility. Yet, customizing and upgrading SAP environments can be time consuming and costly. Now SAP teams can innovate faster and more cost effectively while also maximizing availability by using SAP Landscape Virtualization Manager with NetApp Private Storage for AWS & NetApp Cloud ONTAP.
SAP and NetApp will show how to dramatically accelerate SAP development cycles by instantly activating the compute and storage resources needed for dev, test, training and conference room pilots. We’ll also show how this automated solution can reduce compute and storage costs by up to 50% and provide disaster recovery virtually for free. The solution demonstration will include how to clone multiple copies of an SAP system in 5 minutes using no extra storage. Along the way we’ll share cloud-readiness advice gained in jointly developing this solution.
The NetApp team will then review the NetApp-AWS solutions at the foundation of this SAP infrastructure and share new capabilities in our expanding joint portfolio that can help you:
• Gain cloud compute benefits for workloads that demand maximum performance, scale, availability, or control: NetApp Private Storage for AWS – uses Amazon EC2
• Enhance AWS cloud storage with the power of enterprise data management — with a software-only solution: NetApp Cloud ONTAP for AWS – uses Amazon EC2 & Amazon EBS
• Solve backup and archive headaches with cloud-integrated storage: NetApp SteelStore – uses Amazon S3 & Amazon Glacier.
• Store data across Amazon regions & decades with scalable, durable object storage across your premises and AWS: NetApp StorageGRID Webscale – uses Amazon S3
Doug Cutting discusses:
- A brief history of Spark and its rise in popularity across developers and enterprises
- Spark's advantages over MapReduce
- The One Platform Initiative and the roadmap for Spark
- The future of data processing in Hadoop
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"
So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.
In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.
It’s becoming clear that enterprises need more than one cloud. Hybrid enables enterprises to optimize how their business works – public cloud for elasticity and scale, multi-cloud for redundancy and choice, and on-premises for performance and privacy. Cloudera delivers a hybrid cloud solution that works where enterprises work, with the agility, security and governance enterprise IT needs, and the self-service analytics business people and enterprise data professionals demand. In this session, we will talk about how Cloudera helps deliver hybrid solutions for enterprises and will run a hands-on Cloudera PaaS demo to exhibit:
- Altus Environment Setup
- Configure Altus SDX
- Spin-up transient clusters with Altus
- Execute workload on Altus Data Engineering clusters
- Run interactive queries on object store with Altus Data Warehouse
- Job Analytics with Workload Experience Manager (WXM)
Speaker: Junaid Rao, Senior Cloud Sales Engineer, Cloudera
Time and again, research shows organisations are held back in their digital transformation because of a lack of skills. A recent IDC survey shows it's the case for nearly half of all organisations when it comes to specialist big data and data science skills. As an organisation, how do you know you're hiring the right people to close the gap? As an individual, how do you prove you know what you're doing?
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
When your databases support mission-critical applications, latency and outages can hurt your business. That’s why you need monitoring and management tools to help you keep your enterprise Postgres servers — and the applications they support — consistently available and consistently fast. In this webinar, you’ll learn:
The various tasks — monitoring, administration, etc. — required to keep a database server working well, and how they differ
Why it’s hard to monitor databases with general-purpose monitoring tools
The main tools available for enterprise Postgres needs
How these solutions differ, and when and why to choose each of them for specific cases
By the end of this session, you will have an understanding of how to avoid downtime and optimize the user experience with database monitoring tools.
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
3 Things to Learn About:
*On-premises versus the cloud
*Design & benefits of real-time operational data in the cloud
*Best practices and architectural considerations
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
3 Things to Learn About:
* The concept of lambda architectures
* The Hadoop ecosystem components involved in lambda architectures
* The advantages and disadvantages of lambda architectures
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...DataWorks Summit
Scheduler of a container orchestration system, such as YARN and K8s, is a critical component that users rely on to plan resources and manage applications.
And if we assess where we are today, in YARN effectively it had two power schedulers (Fair and Capacity scheduler) and both serve many strong use cases in big data ecosystem. It can scale up to 50k nodes per cluster, and schedule 20k containers per second, and extremely efficient to manage batch workloads.
K8s default scheduler is an industry-proven solution to efficiently manage long-running services. As more big data apps are moving to K8s and cloud world, but many features like hierarchical queues to support multi-tenancy better, fairness resource sharing, and preemption, etc. are either missing or not mature enough at this point of time to support big data apps running on K8s.
At this point, there is no solution that exists to address the needs of having a unified resource scheduling experiences across platforms. That makes it extremely difficult to manage workloads running on different environments, from on-premise to cloud.
Hence evolving a common scheduler powered from YARN and K8s’s legacy capabilities and improving towards cloud use cases will focus more on use cases like:
Better bin-packing scheduling (and gang scheduling)
Autoscale up and shrink policy management
Effectively run batch workloads and services with clear SLA’s
In summary, we are improving core scheduling capabilities to manage both K8s and YARN cluster which is cloud aware as a separate initiative and above-mentioned cases will be the core focus of this initiative. More details of our works will be presented in this talk.
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMAmazon Web Services
Are you ready to streamline your SAP processes?
The pace of change in business and IT is relentless. To stay on top, enterprise leaders must focus on agility. Yet, customizing and upgrading SAP environments can be time consuming and costly. Now SAP teams can innovate faster and more cost effectively while also maximizing availability by using SAP Landscape Virtualization Manager with NetApp Private Storage for AWS & NetApp Cloud ONTAP.
SAP and NetApp will show how to dramatically accelerate SAP development cycles by instantly activating the compute and storage resources needed for dev, test, training and conference room pilots. We’ll also show how this automated solution can reduce compute and storage costs by up to 50% and provide disaster recovery virtually for free. The solution demonstration will include how to clone multiple copies of an SAP system in 5 minutes using no extra storage. Along the way we’ll share cloud-readiness advice gained in jointly developing this solution.
The NetApp team will then review the NetApp-AWS solutions at the foundation of this SAP infrastructure and share new capabilities in our expanding joint portfolio that can help you:
• Gain cloud compute benefits for workloads that demand maximum performance, scale, availability, or control: NetApp Private Storage for AWS – uses Amazon EC2
• Enhance AWS cloud storage with the power of enterprise data management — with a software-only solution: NetApp Cloud ONTAP for AWS – uses Amazon EC2 & Amazon EBS
• Solve backup and archive headaches with cloud-integrated storage: NetApp SteelStore – uses Amazon S3 & Amazon Glacier.
• Store data across Amazon regions & decades with scalable, durable object storage across your premises and AWS: NetApp StorageGRID Webscale – uses Amazon S3
Doug Cutting discusses:
- A brief history of Spark and its rise in popularity across developers and enterprises
- Spark's advantages over MapReduce
- The One Platform Initiative and the roadmap for Spark
- The future of data processing in Hadoop
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"
So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.
In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.
It’s becoming clear that enterprises need more than one cloud. Hybrid enables enterprises to optimize how their business works – public cloud for elasticity and scale, multi-cloud for redundancy and choice, and on-premises for performance and privacy. Cloudera delivers a hybrid cloud solution that works where enterprises work, with the agility, security and governance enterprise IT needs, and the self-service analytics business people and enterprise data professionals demand. In this session, we will talk about how Cloudera helps deliver hybrid solutions for enterprises and will run a hands-on Cloudera PaaS demo to exhibit:
- Altus Environment Setup
- Configure Altus SDX
- Spin-up transient clusters with Altus
- Execute workload on Altus Data Engineering clusters
- Run interactive queries on object store with Altus Data Warehouse
- Job Analytics with Workload Experience Manager (WXM)
Speaker: Junaid Rao, Senior Cloud Sales Engineer, Cloudera
Time and again, research shows organisations are held back in their digital transformation because of a lack of skills. A recent IDC survey shows it's the case for nearly half of all organisations when it comes to specialist big data and data science skills. As an organisation, how do you know you're hiring the right people to close the gap? As an individual, how do you prove you know what you're doing?
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
When your databases support mission-critical applications, latency and outages can hurt your business. That’s why you need monitoring and management tools to help you keep your enterprise Postgres servers — and the applications they support — consistently available and consistently fast. In this webinar, you’ll learn:
The various tasks — monitoring, administration, etc. — required to keep a database server working well, and how they differ
Why it’s hard to monitor databases with general-purpose monitoring tools
The main tools available for enterprise Postgres needs
How these solutions differ, and when and why to choose each of them for specific cases
By the end of this session, you will have an understanding of how to avoid downtime and optimize the user experience with database monitoring tools.
This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and distributed processing systems like Hadoop integrate with existing applications and databases. The combination of Big Data systems with virtualization gives Hadoop and other Big Data technologies the key benefits of cloud computing: elasticity, multi-tenancy and high availability. A new open source project that VMware will announce at the Hadoop Summit will make it easy to deploy, configure and manage Hadoop on a virtualized infrastructure. We will discuss reference architectures for key Hadoop distributions anddiscuss future directions of this new open source project.
Introduction to Cloud Computing Data Center and Network Issues to Internet Research Lab at NTU, Taiwan. Another definition of cloud computing and comparison of traditional IT warehouse and current cloud data center. (ppt slide for download.) Take a opensource data center management OS, OpenStack, as an example. Underlying network issues inside a cloud DC.
Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
Slides from our recent workshop for hedge funds and a review of the cloud grid computing options. Included some live demos tackling 2TB of full depth market data using MATLAB on AWS, and Google BigQuery with Datalab.
Similar to Hadoop World 2011: Hadoop as a Service in Cloud (20)
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
2. Cloud: Big Shifts in Simplification and Optimization
1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile
Costs IT Service Delivery
to simplify operations to redirect investment into to meet and anticipate the
and maintenance value-add opportunities needs of the business
2
3. Infrastructure, Apps and now Data…
Build Run
Private
Public
Manage
Simplify Infrastructure Simplify App Platform
Next Trend:
With Cloud Through PaaS
Simplify Data
3
4. Trend 1/3: New Data Growing at 60% Y/Y
Exabytes of information stored 20 Zetta by 2015
1 Yotta by 2030
Yes, you are part
of the yotta
audio generation…
digital tv
digital photos
camera phones, rfid
medical imaging, sensors
satellite images, games, scanners, twitter
cad/cam, appliances, videoconfercing, digital movies
Source: The Information Explosion , 2009
4
6. Trend 3/3: Value from Data Exceeds Hardware Cost
Value from the intelligence of data analytics now outstrips the cost
of hardware
• Hadoop enables the use of lower cost hardware
• Hardware cost halving every 18mo
Value
Big Iron:
$40k/CPU
Commodity
Cluster:
$1k/CPU
Cost
6
7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware
Trend is ―not just hadoop‖ for big data
• Hadoop is often combined with other
technologies: Big SQL, NoSQL etc,…
SQLCluster
• Unify the infrastructure platform for all
Big SQL NoSQL Hadoop
NoSQL Cluster
Unified Big Data Infrastructure
Private
Public
Hadoop Cluster
Common Hardware Base
• Eliminate the hardware/driver/testing phase
• Use existing team for
DSS Cluster ordering, diagnosis, capacity management of
7
hardware farm
8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning
I WANT MY HADOOP CLUSTER NOW!
Instant Cluster Provisioning
• Provision Hadoop Clusters instantly
• Automatable using provisioning
engines/scripts: e.g. whir
8
9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities
Increase Utilization
• Hadoop cluster only uses resources it needs
• Extra resources can be used by other applications when not in use
Eliminate single points of failure
• Use vSphere HA for Namenode and Jobtracker
Use VM Isolation
• Create separate clusters with defensible security
• Enables multiple-versions of Hadoop on the same infrastructure
• Extends to Hadoop and Linux Environments
Leverage Resource Management
• Control/assign resources through resource pools
• E.g. Use spare cycles for Hadoop Processing through priority control
9
10. What? Hadoop in a VM? Really?
Actually, Hadoop performs well in a virtual machine
10
13. Hadoop Configuration
Distribution
• Cloudera CDH3u0
• Based on Apache open-source 0.20.2
Parameters
• dfs.datanode.max.xcievers=4096
• dfs.replication=2
• dfs.block.size=134217728
• io.file.buffer.size=131072
• mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
• mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
Network topology
• Hadoop uses info for reliability and performance
• Multiple VMs/host: Each host is a “rack”
13
14. Benchmarks
Derived from test apps included in distro
Pi
• Direct-exec Monte-Carlo estimation of pi
• # map tasks = # logical processors
• 1.68 T samples
TestDFSIO
• Streaming write and read
~ 4*R/(R+G) = 22/7
• 1 TB
• More tasks than processors
Terasort
• 3 phases: teragen, terasort, teravalidate
• 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
• More tasks than processors
• CPU, networking, and storage I/O
14
15. Performance of Hadoop for Several Workloads
Ratio of time taken – Lower is Better
1.2
1
0.8
Ratio to Native
0.6
1 VM
0.4
2 VMs
0.2
0
15
16. Architecting Hadoop as a Service using Virtualization
Goals
• Make it fast and easy to provision new Hadoop Clusters on Demand
• Leverage virtual machines to provide isolation (esp. for Multi-tenant)
• Optimize Hadoop’s performance based on virtual topologies
• Make the system reliable based on virtual topologies
Leveraging Virtualization
• Elastic scale in/out
• Use high-availability to protect namenode/job tracker
• Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed
environment
16
17. Provisioning
Leverage the vSphere APIs to auto-deploy a cluster
• Whirr, HOD, or custom using ruby, chef, etc,…
Use linked-clones to rapidly fork many nodes
17
18. Fast Provisioning
From a ―seed‖ node to a cluster
Thin Provisioning Linked Clone
60GB => 3.5GB ~6 second
18
19. SAN, NAS or Local Disk?
Shared Storage: SAN or NAS Hybrid Storage
• Easy to provision • SAN for boot images, VMs, other
• Automated cluster rebalancing workloads
• Local disk for HDFS
• Scalable Bandwidth, Lower Cost/GB
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Other VM
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host Host Host Host Host Host
19
20. Enable Automatic Rack awareness through vSphere
Important to robust hadoop
cluster
Automatic network topology
detect — an important
vSphere feature
Rack script is generated
automatically
20
21. Multi-tenant: share cluster or not
Shared big cluster VS. Isolated small clusters
High performance Secure
Large scale Flexible
Pre-job provisioning Post-job provisioning
Combination – as customers’ requirement are different
21
22. Elastic Hadoop Cluster
Traditional hadoop cluster
• Easy to scale out
• Fast-provision new hadoop nodes and join into existing cluster
• Hard to scale in
While (ClusterIsTooLarge) {
choose node k;
kill (node k);
wait (k’s data block is recovered);
if necessary, hadoop.rebalance();
}
Elastic hadoop cluster
…
Normal node
NN JT Elastic node
TaskTracker
…
DataNode
22
23. Replica Placement
Second Replica
• Different rack
• Rack-awareness required
Third Replica
• Same rack, different physical host
• Nodes share host (in virtualized
environment)
23
25. Performance
Create more smaller VMs
• Makes Hadoop scale better
• Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
(including through DRS)
Sizing/Configuration of storage is critical
• Plan on ~50Mbytes/sec of bandwidth per core
• SANs are typically configured by default for IOPS, not Bandwidth
• Ensure SAN ports/switch topology allows required aggregate bandwidth
• Performance of the backend storage should be tested/sized
• Local disks will give ~100-140MBytes/sec per disk: pick correct controller
25
26. Summary
Hadoop does work well in a virtual environment
Plan a virtual cluster, enable other big-data solutions on the same
infrastructure
Leverage the recipes to automate your configuration and
deployment
26