Sahara provides a way to deploy and manage Hadoop clusters within an OpenStack cloud. It addresses common customer needs like providing an elastic environment for data processing jobs, integrating Hadoop with the existing private cloud infrastructure, and reducing costs. Key challenges include speeding up cluster provisioning times, supporting complex data workflows, optimizing storage architectures, and improving performance when using remote object storage.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
Alluxio Tech Talk
Jul 17, 2019
Speakers:
Brien Porter, Intel
Alex Ma, Alluxio
The ever increasing challenge to process and extract value from exploding data with AI and analytics workloads makes a memory centric architecture with disaggregated storage and compute more attractive. This decoupled architecture enables users to innovate faster and scale on-demand. Enterprises are also increasingly looking towards object stores to power their big data & machine learning workloads in a cost-effective way. However, object stores don’t provide big data compatible APIs as well as the required performance.
In this webinar, the Intel and Alluxio teams will present a proposed reference architecture using Alluxio as the in-memory accelerator for object stores to enable modern analytical workloads such as Spark, Presto, Tensorflow, and Hive. We will also present a technical overview of Alluxio.
My presentation for the first user group meeting of our lab's Big Data IWT TETRA project [*]. In the presentation, I gave a demo of Cloudera Manager, discussed 4 micro benchmarks and finalized the presentation with an overview of the Big Bench benchmark.
[*] For more information on what IWT TETRA funding exactly is, see http://www.iwt.be/english/funding/subsidy/tetra
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
Alluxio Tech Talk
Jul 17, 2019
Speakers:
Brien Porter, Intel
Alex Ma, Alluxio
The ever increasing challenge to process and extract value from exploding data with AI and analytics workloads makes a memory centric architecture with disaggregated storage and compute more attractive. This decoupled architecture enables users to innovate faster and scale on-demand. Enterprises are also increasingly looking towards object stores to power their big data & machine learning workloads in a cost-effective way. However, object stores don’t provide big data compatible APIs as well as the required performance.
In this webinar, the Intel and Alluxio teams will present a proposed reference architecture using Alluxio as the in-memory accelerator for object stores to enable modern analytical workloads such as Spark, Presto, Tensorflow, and Hive. We will also present a technical overview of Alluxio.
My presentation for the first user group meeting of our lab's Big Data IWT TETRA project [*]. In the presentation, I gave a demo of Cloudera Manager, discussed 4 micro benchmarks and finalized the presentation with an overview of the Big Bench benchmark.
[*] For more information on what IWT TETRA funding exactly is, see http://www.iwt.be/english/funding/subsidy/tetra
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
Overview of the architecture, and benefits of Dell HPC Storage with Intel EE Lustre in High Performance Computing and Big Science workloads.
Presented by Andrew Underwood at the Melbourne Big Data User Group - January 2016.
Lustre is a trademark of Seagate Technology.
Update your private cloud with 14th generation Dell EMC PowerEdge FC640 serve...Principled Technologies
Critical Apache Cassandra NoSQL databases can offer reliability and flexibility for workloads like media streaming or social media. Running these databases in a private cloud can let you maintain control of your data while giving you the agility and flexibility the cloud provides.
In our datacenter, the Dell EMC PowerEdge FC640 solution powered by Intel Xeon Gold 5120 processors dramatically increased performance for Apache Cassandra workloads compared to a legacy solution. By choosing a solution that can do up to 4.7 times the work of the legacy solution, your infrastructure could handle more requests at a time—and we found that the Dell EMC PowerEdge FC640 solution could do all this additional work in less space, which could let you hold off on renting more datacenter space or on building out your existing space as your business grows.
The slides are created for the "Hadoop User Group Vienna", a Meetup that gathers Hadoop users in Vienna on September 6, 2017. The content of the slides correspond to the first talk, which discussed the concepts, terminology and disaster recovery capabilities in the Hadoop ecosystem.
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
Apache Hadoop is a framework for distributed computation and storage of very large data sets on computer clusters. Hadoop began as a project to implement Google’s MapReduce programming model and has become synonymous with a rich ecosystem of related technologies, not limited to Apache Pig, Apache Hive, Apache Spark, Apache HBase, and others
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...inside-BigData.com
In this deck from the LAD'14 Conference in Reims, Rekha Singhal from Tata Consultancy Services presents: Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application.
Learn more: http://insidehpc.com/lad14-video-gallery/
Watch the video presentation: http://inside-bigdata.com/2014/09/29/performance-comparison-intel-enterprise-edition-lustre-hdfs-mapreduce/
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
Advanced big data processing frameworks have been proposed to harness the fast data transmission capability of remote direct memory access (RDMA) over InfiniBand and RoCE. However, with the introduction of the non-volatile memory (NVM), these designs along with the default execution models, like MapReduce and Directed Acyclic Graph (DAG), need to be re-assessed to discover the possibilities of further enhanced performance.
In this context, we propose an accelerated execution framework (NVMD) for MapReduce and DAG that leverages the benefits of NVM and RDMA. NVMD introduces novel features for MapReduce and DAG, such as a hybrid push and pull shuffle mechanism and dynamic adaptation to the network congestion. The design has been incorporated into Apache Hadoop and Tez. Performance results illustrate that NVMD can achieve up to 3.65x and 3.18x improvement for Hadoop and Tez, respectively. In this talk, we will also present NVM-aware HDFS design and its benefits for MapReduce, Spark, and HBase.
Speaker: Shashank Gugnani, PhD Student, Ohio State University
The Importance of Fast, Scalable Storage for Today’s HPCIntel IT Center
Today, data drives discovery. And discoveries create are key to creating sustained advantages. The better your critical workflows are able to create and access data – the better you’ll be able to discover new, innovative solutions to important problems, or to create entirely new products. More than ever before, data intensive applications need the sustained performance and virtually unlimited scalability that only parallel storage software delivers.
Designed for maximum performance and scale, storage solutions powered by Lustre software deliver the performance at scale to meet today’s storage requirements. As the most widely used parallel storage system for HPC, Lustre-powered storage is the ideal storage foundation.
But scalable performance storage by itself only solves half the problem. Today’s users expect storage solutions that deliver sustained performance, scale upward to near limitless capacities, and are simple to install and manage. Intel(r) Enterprise Edition for Lustre* software combines the straight line speed and scale of Lustre with the bottom line need for lowered management complexity and cost.
As the recognized leaders in the development and support of the Lustre file system, Intel has the expertise to make storage solutions for data intensive applications faster, smarter and easier.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Overview of the architecture, and benefits of Dell HPC Storage with Intel EE Lustre in High Performance Computing and Big Science workloads.
Presented by Andrew Underwood at the Melbourne Big Data User Group - January 2016.
Lustre is a trademark of Seagate Technology.
Update your private cloud with 14th generation Dell EMC PowerEdge FC640 serve...Principled Technologies
Critical Apache Cassandra NoSQL databases can offer reliability and flexibility for workloads like media streaming or social media. Running these databases in a private cloud can let you maintain control of your data while giving you the agility and flexibility the cloud provides.
In our datacenter, the Dell EMC PowerEdge FC640 solution powered by Intel Xeon Gold 5120 processors dramatically increased performance for Apache Cassandra workloads compared to a legacy solution. By choosing a solution that can do up to 4.7 times the work of the legacy solution, your infrastructure could handle more requests at a time—and we found that the Dell EMC PowerEdge FC640 solution could do all this additional work in less space, which could let you hold off on renting more datacenter space or on building out your existing space as your business grows.
The slides are created for the "Hadoop User Group Vienna", a Meetup that gathers Hadoop users in Vienna on September 6, 2017. The content of the slides correspond to the first talk, which discussed the concepts, terminology and disaster recovery capabilities in the Hadoop ecosystem.
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
Apache Hadoop is a framework for distributed computation and storage of very large data sets on computer clusters. Hadoop began as a project to implement Google’s MapReduce programming model and has become synonymous with a rich ecosystem of related technologies, not limited to Apache Pig, Apache Hive, Apache Spark, Apache HBase, and others
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...inside-BigData.com
In this deck from the LAD'14 Conference in Reims, Rekha Singhal from Tata Consultancy Services presents: Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application.
Learn more: http://insidehpc.com/lad14-video-gallery/
Watch the video presentation: http://inside-bigdata.com/2014/09/29/performance-comparison-intel-enterprise-edition-lustre-hdfs-mapreduce/
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
Advanced big data processing frameworks have been proposed to harness the fast data transmission capability of remote direct memory access (RDMA) over InfiniBand and RoCE. However, with the introduction of the non-volatile memory (NVM), these designs along with the default execution models, like MapReduce and Directed Acyclic Graph (DAG), need to be re-assessed to discover the possibilities of further enhanced performance.
In this context, we propose an accelerated execution framework (NVMD) for MapReduce and DAG that leverages the benefits of NVM and RDMA. NVMD introduces novel features for MapReduce and DAG, such as a hybrid push and pull shuffle mechanism and dynamic adaptation to the network congestion. The design has been incorporated into Apache Hadoop and Tez. Performance results illustrate that NVMD can achieve up to 3.65x and 3.18x improvement for Hadoop and Tez, respectively. In this talk, we will also present NVM-aware HDFS design and its benefits for MapReduce, Spark, and HBase.
Speaker: Shashank Gugnani, PhD Student, Ohio State University
The Importance of Fast, Scalable Storage for Today’s HPCIntel IT Center
Today, data drives discovery. And discoveries create are key to creating sustained advantages. The better your critical workflows are able to create and access data – the better you’ll be able to discover new, innovative solutions to important problems, or to create entirely new products. More than ever before, data intensive applications need the sustained performance and virtually unlimited scalability that only parallel storage software delivers.
Designed for maximum performance and scale, storage solutions powered by Lustre software deliver the performance at scale to meet today’s storage requirements. As the most widely used parallel storage system for HPC, Lustre-powered storage is the ideal storage foundation.
But scalable performance storage by itself only solves half the problem. Today’s users expect storage solutions that deliver sustained performance, scale upward to near limitless capacities, and are simple to install and manage. Intel(r) Enterprise Edition for Lustre* software combines the straight line speed and scale of Lustre with the bottom line need for lowered management complexity and cost.
As the recognized leaders in the development and support of the Lustre file system, Intel has the expertise to make storage solutions for data intensive applications faster, smarter and easier.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...Leila Esmaeili
اجلاس جهانی سران جامعه اطلاعاتی (World Summit on the Information Society -WSIS) سالانه توسط اتحادیه جهانی مخابرات و با پشتیبانی سازمان ملل متحد با موضوع فناوری اطلاعات و ارتباطات و جامعه اطلاعاتی (ITU) برگزار می-گردد. این نشست جهانی که با حضور وزرای مخابرات، فناوری اطلاعات و ارتباطات کشورهای دنیا شکل می ¬گیرد هجده خط مشی اصلی را دنبال می ¬کند تا کشورهای جهان را ترغیب نماید در راستای این هجده خط مشی گام برداشته و فناوری اطلاعات و ارتباطات کشور خود را توسعه و ترقی دهند.
از مهم ¬ترین خط¬ مش¬ های تعیین شده، توسعه موارد کاربرد فناوری اطلاعات و ارتباطات در زمینه آموزش الکترونیکی (e-learning) و علم الکترونیکی (e-science) می¬ باشد که راهکار یکپارچه آموزش مجازی (ELIS) و شبکه کوثرنت (Kowsar-Net) به ترتیب در این راستا توسعه یافته و هم¬اکنون به عنوان نماینده ایران برای کسب جایزه جهانی اجلاس جهانی سران با رقبای خارجی خود وارد رقابت شده است. هم اکنون، این دو پروژه پس از قبولی در مرحله ارزیابی و داوری بین المللی نامزد دریافت جایزه بهترین پروژه در گروه آموزش الکترونیکی و علم الکترونیک شده و در مرحله رای¬ گیری عمومی می ¬باشد.
از این رو، این پیام با هدف حمایت از نماینده کشورمان در این عرصه بین المللی ارسال شده است. با توجه به رسالت این دو پروژه و ماهیت علمی، فرهنگی و اسلامی آن، از شما سرور گرامی دعوت می شود ضمن رای دادن و اعلام حمایت خود از پروژه های نامبرده (گروه 9 و 14)، سایر دوستان، همکاران و متخصصان را به حمایت از راهکار یکپارچه آموزش مجازی (ELIS) و شبکه کوثرنت (Kowsar-Net) دعوت نمایید. ضمنا رای شما به سایر پروژه¬ ها در دیگر گروه ها نیز ارزشمند است.
آدرس سایت جهت ثبت نام و رای دهی:
http://groups.itu.int/stocktaking/WSISPrizes/WSISPrizes2016/Voting.aspx
آخرین مهلت رای دهی: 20 اسفند 94
جهت کسب اطلاعات بیشتر کانال ما را در تلگرام دنبال کنید wsisprizeswhc@
Hadoop is often viewed as needing racks of dedicated boxes -despite the fact that in sheer number terms, the majority of Hadoop clusters ever created have been brought up on public cloud infrastructures -particularly Amazon`s. Yet the rest of datacenter computing is moving towards virtualization -be it in-cloud startups or in-enterprise IT departments. Some organizations are standing up private clouds: a rack or two of servers with an API for VM creation. Hadoop can live there -it just needs to integrate better. At the same time, OpenStack is emerging as the de-facto standard open source cloud platform for private use, and is available publicly from a number of cloud infrastructure service providers. This talk looks at what we`ve done -and are doing- to integrate Hadoop with OpenStack. This is taking it beyond Hadoop`s current support for Amazon`s infrastructure, making a combined Hadoop + OpenStack cluster something to consider in-house -and in-cloud. This work is being done in collaboration with members of the OpenStack community, showing how cloud and big data projects can not only co-exist, we can co-develop our platforms.
اسلایدهای کارگاه عملی امنیت داده و مدیریت مخاطرات در رایانش ابری - 20 آبان - دانشگاه شریف
به همراه مثالهایی از رایانش ابری به زبان ساده و نتایج مطالعه موردی تحلیل مخاطرات فکس ابری
The Evolution of OpenStack – From Infancy to EnterpriseRackspace
As OpenStack turns 5 this year, we thought it would be a good time to take a look back at the evolution of OpenStack. We start with a quick overview of what OpenStack is, how OpenStack came to be and describe the OpenStack Foundation. Next we describe the problem that OpenStack helps to solve, the components of OpenStack and the timeline for when these components came to be. Last, we outline the current features and benefits that make OpenStack ready for the enterprise with supporting Enterprise use case examples. Blog can be found here (
https://developer.rackspace.com/blog/evolution-of-openstack-from-infancy-to-enterprise/) and webinar can be found here (https://www.brighttalk.com/webcast/11427/138613)
The massive computing and storage resources that are needed to support big data applications make cloud environments an ideal fit. In this session, you'll learn how to build your big data "database on-demand" using MongoDB, Cassandra, Solr, MySQL, or any other big data solution, as well as manage your big data application using a new open source framework called “Cloudify.” All this, on top of the OpenStack cloud.
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Improve performance and gain room to grow by easily migrating to a modern Ope...Principled Technologies
We deployed this modern environment, then migrated database VMs from legacy servers and saw performance improvements that support consolidation
Conclusion
If your organization’s transactional databases are running on gear that is several years old, you have much to gain by upgrading to modern servers with new processors and networking components and an OpenShift environment. In our testing, a modern OpenShift environment with a cluster of three Dell PowerEdge R7615 servers with 4th Generation AMD EPYC processors and high-speed 100Gb Broadcom NICs outperformed a legacy environment with MySQL VMs running on a cluster of three Dell PowerEdge R7515 servers with 3rd Generation AMD EPYC processors and 25Gb Broadcom NICs. We also easily migrated a VM from the legacy environment to the modern environment, with only a few steps required to set up and less than ten minutes of hands-on time. The performance advantage of the modern servers would allow a company to reduce the number of servers necessary to perform a given amount of database work, thus lowering operational expenditures such as power and cooling and IT staff time for maintenance. The high-speed 100Gb Broadcom NICs in this solution also give companies better network performance and networking capacity to grow as they embrace emerging technologies such as AI that put great demands on networks.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
VMworld 2013
Chris Greer, FedEx
Richard McDougall, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
6. Exploring new opportunities in Big Data-as-a-Service(BDaaS)
o Researching the possibility BDaaS solution
o Let BDaaS become better in IT infrastructure
o Moving forward the future of BDaaS
Focusing on Sahara in OpenStack
o Bring CDH into Sahara
o Create more features in Sahara
o Rank #1 in LOC, #3 in Commits for Sahara contribution
ABOUT OUR TEAM
8. oYou or someone at the company is using public Big Data application services
like AWS EMR.
You need Sahara to migrate Big Data application to your private cloud
oYou have multiple Hadoop clusters in your environment and you would like to
integrate them for better infrastructure utilization.
You need Sahara to virtualized Hadoop into cloud infrastructure.
oYou are using OpenStack as a IT cloud infrastructure for many years and there
is a Hadoop cluster also running in your IT environment.
You must use Sahara to bring them together as a unified IT environment for
better maintenance.
FROM THE CUSTOMER NEEDS
source from OpenStack Vancouver Design Summit: Benchmarking Sahara-based as a Service solution by RedHat & Intel
9. Data Scientists/Analysts
o Provide an elastic way to run big data application
Developers
o Bring a custom big data infrastructure by different needs
Administrator/Operators
o A better way to maintain not only hardware platform but also software package
Company
o Cost, cost, cost
BETTER USER EXPERIENCE MEANS…
10. A COMPLEX BIG DATA SOLUTION
Structured, Unstructured Data Big Data Solution
Different type data sources Complexity in organizing Data(ETL)
BI Report
Diverse BI Report
Pig
ZooKeeper
13. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance
Data Node
Pattern 1: Internal HDFS
Collect Application
Collecting Data
OpenStack support to create HDFS on Cinder
or Ephemeral Disk. This method can provide a
better data processing performance via
Ephemeral Disk or to persist the data via
Cinder with lower performance.
Node Manager
Pros:
Performance would be extreme fast.(depends on the
storage backend)
Cons:
Data persistence may be a problem if you would like
to follow with the life of Virtual Cluster.
14. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance 1
Pattern 2: External HDFS
Collect Application
Collecting Data
You can also choose to deploy HDFS to two
different instances. This way can bring you
more elasticity to manage your instances when
you would like to save more compute power
via turn off your node manager instance.
Node
Manager
Pros:
Performance may be the same as Pattern 1, but it can
bring more flexible to control your instances, save the
power, and also persist your data in data node.
Cons:
A long run cluster may still need to consider another
way for persisting data.
Instance 2
Data Node
15. SAHARA DATA PROCESSING PATTERN
OpenStack
Instance
Pattern 3: Swift
Collect Application
Collecting Data
Use Swift can stream the data from storage to
Hadoop directly. It provide a way to store your
data externally and solve the data persistence
problem. Currently Swift can also support data
locality feature.
Node Manager
Pros:
Streaming data directly and integrating with your
Swift infrastructure.
Cons:
Performance could be an issue when comparing with
other pattern by using HDFS.
Swift
Streaming Data
16. Cluster Deployment
o Service Deployment
Compute Engine Choice
o Baremetal, KVM, Docker, Hyper-V, vSphere,
Xen
Storage Architecture
o Ephemeral Disk
o Persistent Volume
o Performance
o Cost
o Current IT Infrastructure
Deployment Consideration
Host
Instance Instance …Instance
Data
Bare Metal KVM Container
Ephemeral
Block
Storage
Data Data
Node
Manager
Node
Manager
Node
Manager
Object
Storage
Compute
Engine
Storage
Infrastructure
Cluster
Deployment
18. Issue1 - Provision a Cluster Takes a Long Time
Problem Description:
o 10000+ jobs per day including several different workloads(some jobs run in SECs and some jobs
run in HOURs)
o Hard to sort out a job is small or large, it is not only about data size but also in logistic
o Provisioning a cluster takes a longer time than running a small job in secs, for example: launch a
4-nodes cluster in 10+ mins
Customer’s Feedback:
o Finish job on time, no need to worry about provisioning a cluster
Possible Solutions/Alternatives:
o Run jobs in an existing cluster(depends on the cases)
o Run jobs in a public cluster using Resource ACL(will support in Liberty)
o To reduce the time for provisioning a cluster -> Plugin specific
o Use Docker can save time to launch an instance, but still need time to launch services
20. Docker also get the advantage when instance is idle
0
10
20
30
40
50
60
70
80
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
193
201
209
217
225
233
241
249
257
265
273
281
289
297
305
313
321
CPUUsageInPercent
Time
Docker: Compute Node CPU (full test duration)
usr
sys
Averages
– 0.54
– 0.17
0
10
20
30
40
50
60
70
80
1
10
19
28
37
46
55
64
73
82
91
100
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
244
253
262
271
280
289
298
307
316
325
334
343
CPUUsageInPercent
Time
KVM: Compute Node CPU (full test duration)
usr
sys
Averages
– 7.64
– 1.4
Source from IBM: Boden Russell (Performance Characteristics of Traditional VMs vs Docker Containers)
21. Issue2 - A complex data processing
Problem Description:
o A job usually run multiple sub-jobs in a row, Ex: Job A -> Job B -> Job C, and also need to
support scheduling a job
Customer’s Feedback:
o Running a complex job to fulfill their case
o To Schedule a job using Sahara EDP
o Running a recurring job
oPossible Solutions/Alternatives:
• Currently Sahara EDP only support to run a simple job
• Schedule a job -> BP: https://review.openstack.org/#/c/175719/
• A complex job running -> Under discussion
• Running a recurring job -> Under discussion
22. Issue3 - Storage Architecture
Problem Description:
o Currently our customers use individual Compute Cluster(Using Nova) and Storage
Cluster(Using Swift as an Object Storage for data store). But there is a performance issue if
compute and data put in different node, to transfer data must pass through network.
Customer’s Expectation:
o Find a better solution to fulfill their requirements and integrate to their current storage
architecture
Possible Solutions/Alternatives:
o Use Internal HDFS -> Needs a way to copy data from Swift to Internal HDFS
o Use Swift Data Locality Feature -> Must change their storage architecture
23. Two-phases in Sort running period for disk write
o Shuffle Map-Reduce Data -> Use temp folder to store
o intermediate data(40%total throughput)
• Write Output -> HDFS Write(60%total throughput)
Sort Workload Profile
Shuffling data using temp folder
Write output to HDFS/External Storage
Disk IO Peak
24. 1. Hadoop temp Folder Location
2. HDFS Location
3. Data Persistent
4. Integrate with current Storage Architecture, usually use shared
storage in cloud
5. Optimize storage by your workload
Storage Consideration
25. Redundant Issue when HDFS over Ceph/GlusterFS
Compute Cluster
Instance1
HDFS
Instance2
HDFS
…..
Instance3
HDFS
Ceph Cluster
Cinder
DATA DATA DATA
A DATA C DATAB DATA
A DATA B DATAC DATA
C DATAB DATA A DATA
3(in HDFS) x 3(in Ceph)
= 9 Replicas in Ceph
Cluster
26. Cinder Volume Instance Locality Support in Sahara
Compute1
Instance1
HDFS
Instance2
HDFS
…..
Instance3
HDFS
Cinder-volume
DATA DATA DATA
Volume1 Volume2 Volume3
Compute2
Instance4
HDFS
Instance5
HDFS
…..
Instance6
HDFS
Cinder-volume
DATA DATA DATA
Volume4 Volume5 Volume6
Nova Nova
27. Performance Impact from
o Swift overhead comes from “Rename” method in Hadoop
o “List Endpoint” feature bring huge impact
o Larger data size may deliver worse performance gap
27
Swift Performance Issue
Host
Swift
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS…..
…..
vs.
1.25x
overhead
1.67x
overhead
1X
28. The output of the reduce function is written to a temporary location in HDFS.
After completing, the output will automatically renamed from its temporary
location to its final location.
Rename in Reduce Task
ANALYSIS
• Object storage cannot support
rename, swiftfs use “copy and delete”
for rename function.
• HDFS Rename -> Change METADATA
in Name Node
• Swift Rename -> Copy new object and
Delete the older one in Swift
1.5x overhead
local to swift
swift to swift
local to hdfs
29. Issue4 - Scaling a Cluster
Problem Description:
o Current there are several issues they found when using scaling a cluster, they would like to
ask Community to improve their experience
Customer’s Expectation:
o Rebalancing HDFS after scaling
o Auto-scale a cluster by request(ex: job size, …etc)
Possible Solutions/Alternatives:
o Rebalance HDFS -> BP: https://blueprints.launchpad.net/sahara/+spec/hdfs-rebalance
o Auto-scaling -> Needs be discussed
30. Issue5 - OpenStack Version Support
Problem Description:
o New features usually support in new release, customers would like to use new feature in old
environment
o Some new features cannot be accepted to backport to an older one
Customer’s Expectation:
o Customers would like to use new feature in Kilo or later version OpenStack
Possible Solutions/Alternatives:
o Rolling Upgrade from Juno to Kilo
o Only use Sahara and Horizon in Kilo and other OpenStack project in Juno -> We haven’t try
this
o In the future, plugin will support backward compatible, let plugin can separate with Sahara
32. oVanilla support Hadoop v1.2.1 and Hadoop 2.6
oSpark Plugin
oCloudera CDH Plugin
oMapR Plugin
oStorm Plugin
oNew Horizon UI with a Guide Panel
oDefault Template Support
What’s New in Kilo
33. oSahara EDP is the focus to process data flow
oSupport more data sources and storage architecture
oSupport more Big Data projects
oIntegrate with other OpenStack projects
oBaremetal -> Ironic
oDocker -> Magnum
oApplication Catalog -> Murano
The Future of Sahara