In this talk, we will give a technical deep dive into the new YARN shared cache feature (i.e. YARN-1492) and explore the benefits we are currently seeing on our production clusters at Twitter. The YARN shared cache aims to optimize the considerable amount of network bandwidth and storage spent on resource localization in YARN. Some of this is mitigated by the NodeManager localization service, but this requires YARN application writers to coordinate resource sharing manually. As YARN deployments continue to grow and the type of YARN applications become more diverse, this coordination becomes increasingly difficult. The YARN shared cache allows applications to securely share resources uploaded by a different instance of the same application, a different application all together, or even a different user. We will give an overview of the shared cache architecture, cover the various components that make up the feature, and give a solid foundation for developers who want to build YARN applications that leverage the shared cache. Additionally, we will use the MapReduce application as a case study for discussion about best practices and the benefits we have seen on our production clusters at Twitter.
A Secure Public Cache For YARN Application Resourcesctrezzo
A talk given at the SF Hadoop User group on May 3 2016. The YARN shared cache is a feature that aims to reduce the overhead of resource localization without requiring application developers to coordinate manually. Leveraging the shared cache, applications can securely share resources localized by a different application or even a different user. In this talk, we will give an overview of the shared cache architecture and a solid foundation for developers who want to build YARN applications that leverage this feature. Additionally, we will discuss some of the benefits we have seen from using the shared cache at Twitter.
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
A Secure Public Cache For YARN Application Resourcesctrezzo
A talk given at the SF Hadoop User group on May 3 2016. The YARN shared cache is a feature that aims to reduce the overhead of resource localization without requiring application developers to coordinate manually. Leveraging the shared cache, applications can securely share resources localized by a different application or even a different user. In this talk, we will give an overview of the shared cache architecture and a solid foundation for developers who want to build YARN applications that leverage this feature. Additionally, we will discuss some of the benefits we have seen from using the shared cache at Twitter.
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
Self-Organisation as a Cloud Resource Management StrategyCloudLightning
Cloud Resource Management is becoming increasingly challenging with the advent of hyperscale computing and the proliferation of heterogeneous hardware. Meanwhile, resource utilisation continues to remain low resulting in high energy consumption per executed instruction. This presentation by Prof. John Morrison suggests a self-organised approach to resource management in an attempt to successfully address these challenges.
This presentation was given at the CloudLightning Conference held in conjunction with NC4 2017 in Dublin City University on 11th April 2017.
Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.
As presented to Portland Big Data User Group on July 23rd 2014.
http://www.meetup.com/Hadoop-Portland/events/194930422/
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
A brief introduction to YARN: how and why it came into existence and how it fits together with this thing called Hadoop.
Focus given to architecture, availability, resource management and scheduling, migration from MR1 to MR2, job history and logging, interfaces, and applications.
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...inside-BigData.com
In this video from the HPC User Forum at Argonne, Daniel Arevalo from DeVine Consulting presents: NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmospheric Model.
Prompted by DoD priorities for modernization, cost savings, and redundancy, this project compared the performance of the NAVGEM on an in-house Cray system against the follow cloud offerings:
* AWS C4.8xlarge
* Penguin B30 queue
* Azure H16r
* AWS c5n.18xlarge
"The Navy Global Environmental Model (NAVGEM) is a global numerical weather prediction computer simulation run by the United States Navy's Fleet Numerical Meteorology and Oceanography Center. This mathematical model is run four times a day and produces weather forecasts. Along with the NWS's Global Forecast System, which runs out to 16 days, the ECMWF's Integrated Forecast System (IFS) and the CMC's Global Environmental Multiscale Model (GEM), both of which run out 10 days, and the UK Met Office's Unified Model, which runs out to 7 days, it is one of five synoptic scale medium-range models in general use."
Watch the video: https://wp.me/p3RLHQ-kOy
Learn more:
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Apache HBase: Where We've Been and What's Upcominghuguk
Jon Hsieh, Software Engineer @ Cloudera and HBase Committer
Apache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I'll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I'll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
"Wire Encryption In HDFS: Protect Your Data From Others, Not Yourself"
ApacheCon 2019, Las Vegas.
SPEAKERS: Chen Liang, Konstantin Shvachko. LinkedIn
Wire data encryption is a key component of the Hadoop Distributed File System (HDFS). HDFS can enforce different levels of data protection, allowing users to specify one based on their own needs. However, such enforcement comes in as an all-or-nothing feature. Namely, wire encryption is enforced either for all accesses or none. Since encryption bears a considerable performance cost, the all-or-nothing condition forces users to choose between 'faster but unencrypted' or 'encrypted but slower' for all clients. In our use case at LinkedIn, we would like to selectively expose fast unencrypted access to fully managed internal clients, which can be trusted, while only expose encrypted access to clients outside of the trusted circle with higher security risks. That way we minimize performance overhead for trusted internal clients while still securing data from potential outside threats. We re-evaluate the RPC encryption mechanism in HDFS. Our design extends HDFS NameNode to run on multiple ports. Depending on the configuration, connecting to different NameNode ports would end up with different levels of encryption protection. This protection then gets enforced for both NameNode RPC and the subsequent data transfers to/from DataNode. System administrators then need to set up a simple firewall rule to allow access to the unencrypted port only for internal clients and expose the encrypted port to the outside clients. This approach comes with minimum operational and performance overhead. The feature has been introduced to Apache Hadoop under HDFS-13541.
To 13 years of Hurricane Catherine. This event was a milestone in the meteorology of the southwestern region of the American continent. For the first time in history a hurricane could be recorded in the southern hemisphere, far from the typical regions of origin of these tropical cyclones. In 2004 little was said about Climate Change in South America; And even less of the increase in the intensity and frequency of severe meteorological phenomena. From a technical education center the subject was approached as a didactic task. The same year 2004, the case study was presented at the II Uruguayan Meeting of Astronomy of Uruguay, celebrated in the city of Piriápolis between 5 and 7 November. It was warned about the probable increase and intensity of the meteorological phenomena. In August 2005, almost simultaneously with the occurrence of Hurricane Katrina in the United States; In Uruguay, suffered one of the worst storms in its history, due to an intense Cyclone of Latitudes Medias. Since then, nothing was the same.
Self-Organisation as a Cloud Resource Management StrategyCloudLightning
Cloud Resource Management is becoming increasingly challenging with the advent of hyperscale computing and the proliferation of heterogeneous hardware. Meanwhile, resource utilisation continues to remain low resulting in high energy consumption per executed instruction. This presentation by Prof. John Morrison suggests a self-organised approach to resource management in an attempt to successfully address these challenges.
This presentation was given at the CloudLightning Conference held in conjunction with NC4 2017 in Dublin City University on 11th April 2017.
Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.
As presented to Portland Big Data User Group on July 23rd 2014.
http://www.meetup.com/Hadoop-Portland/events/194930422/
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
A brief introduction to YARN: how and why it came into existence and how it fits together with this thing called Hadoop.
Focus given to architecture, availability, resource management and scheduling, migration from MR1 to MR2, job history and logging, interfaces, and applications.
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...inside-BigData.com
In this video from the HPC User Forum at Argonne, Daniel Arevalo from DeVine Consulting presents: NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmospheric Model.
Prompted by DoD priorities for modernization, cost savings, and redundancy, this project compared the performance of the NAVGEM on an in-house Cray system against the follow cloud offerings:
* AWS C4.8xlarge
* Penguin B30 queue
* Azure H16r
* AWS c5n.18xlarge
"The Navy Global Environmental Model (NAVGEM) is a global numerical weather prediction computer simulation run by the United States Navy's Fleet Numerical Meteorology and Oceanography Center. This mathematical model is run four times a day and produces weather forecasts. Along with the NWS's Global Forecast System, which runs out to 16 days, the ECMWF's Integrated Forecast System (IFS) and the CMC's Global Environmental Multiscale Model (GEM), both of which run out 10 days, and the UK Met Office's Unified Model, which runs out to 7 days, it is one of five synoptic scale medium-range models in general use."
Watch the video: https://wp.me/p3RLHQ-kOy
Learn more:
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Apache HBase: Where We've Been and What's Upcominghuguk
Jon Hsieh, Software Engineer @ Cloudera and HBase Committer
Apache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I'll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I'll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
"Wire Encryption In HDFS: Protect Your Data From Others, Not Yourself"
ApacheCon 2019, Las Vegas.
SPEAKERS: Chen Liang, Konstantin Shvachko. LinkedIn
Wire data encryption is a key component of the Hadoop Distributed File System (HDFS). HDFS can enforce different levels of data protection, allowing users to specify one based on their own needs. However, such enforcement comes in as an all-or-nothing feature. Namely, wire encryption is enforced either for all accesses or none. Since encryption bears a considerable performance cost, the all-or-nothing condition forces users to choose between 'faster but unencrypted' or 'encrypted but slower' for all clients. In our use case at LinkedIn, we would like to selectively expose fast unencrypted access to fully managed internal clients, which can be trusted, while only expose encrypted access to clients outside of the trusted circle with higher security risks. That way we minimize performance overhead for trusted internal clients while still securing data from potential outside threats. We re-evaluate the RPC encryption mechanism in HDFS. Our design extends HDFS NameNode to run on multiple ports. Depending on the configuration, connecting to different NameNode ports would end up with different levels of encryption protection. This protection then gets enforced for both NameNode RPC and the subsequent data transfers to/from DataNode. System administrators then need to set up a simple firewall rule to allow access to the unencrypted port only for internal clients and expose the encrypted port to the outside clients. This approach comes with minimum operational and performance overhead. The feature has been introduced to Apache Hadoop under HDFS-13541.
To 13 years of Hurricane Catherine. This event was a milestone in the meteorology of the southwestern region of the American continent. For the first time in history a hurricane could be recorded in the southern hemisphere, far from the typical regions of origin of these tropical cyclones. In 2004 little was said about Climate Change in South America; And even less of the increase in the intensity and frequency of severe meteorological phenomena. From a technical education center the subject was approached as a didactic task. The same year 2004, the case study was presented at the II Uruguayan Meeting of Astronomy of Uruguay, celebrated in the city of Piriápolis between 5 and 7 November. It was warned about the probable increase and intensity of the meteorological phenomena. In August 2005, almost simultaneously with the occurrence of Hurricane Katrina in the United States; In Uruguay, suffered one of the worst storms in its history, due to an intense Cyclone of Latitudes Medias. Since then, nothing was the same.
How to live so that the world will be a little different (and better!) for your having been here. Unfolding your own unique mission in life. Bringing new possibilities the world might not have: Known. Seen. Heard.
Every IR presents unique challenges. But - when an attacker uses PowerShell, WMI, Kerberos attacks, novel persistence mechanisms, seemingly unlimited C2 infrastructure and half-a-dozen rapidly-evolving malware families across a 100k node network to compromise the environment at a rate of 10 systems per day - the cumulative challenges can become overwhelming. This talk will showcase the obstacles overcome during one of the largest and most advanced breaches Mandiant has ever responded to, the novel investigative techniques employed, and the lessons learned that allowed us to help remediate it.
Details a massive intrusion by Russian APT29 (AKA CozyDuke, Cozy Bear)
The Impact of IoT on Cloud Computing, Big Data & AnalyticsSyam Madanapalli
IoT Applications are different from typical enterprise applications; and most of the companies are hijacking what the IoT is depending on the what products/solutions they offer. I put up my point of view on how the IoT impacts the traditional Cloud Computing and Big Data Analytics. The takeaway from this presentation is IoT requires realtime computing as the data moves from the physical world to the cyber world (Cloud) to take actions at spatiotemporal location.
Deep phenotyping to aid identification of coding & non-coding rare disease v...mhaendel
Whole-exome sequencing has revolutionized disease research, but many cases remain unsolved because ~100-1000 candidates remain after removing common or non-pathogenic variants. We present Genomiser to prioritize coding and non-coding variants by leveraging phenotype data encoded with the Human Phenotype Ontology and a curated database of non-coding Mendelian variants. Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes.
Buyers no longer use voicemails and emails from strangers to learn about products. This information is online, whenever buyers are interested. This SlideShare presentation show sellers how to connect in a meaningful way by starting conversations around the buyer’s plans, goals and challenges.
This presentation is one class in HubSpot Academy's free sales training course. You can enroll here: http://certification.hubspot.com/inbound-sales-certification
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Edureka!
This Edureka tutorial on Hadoop 3 ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to focus on the changes that are expected in Hadoop 3, as it's still in alpha phase. Apache community has incorporated many changes in Apache Hadoop 3 and is still working on some of them. So, we will be taking a broader look at the expected changes in Hadoop 3:
1. Support For Erasure Encoding In HDFS
2. YARN Timeline Service V.2
3. Shell Script Rewrite
4. Shaded Client Jars
5. Support For Opportunistic Containers
6. Mapreduce Task-level Native Optimization
7. Support For More Than 2 Passive Namenodes
8. Default Ports Of Multiple Services Have Been Changed
9. Intra-DataNode Balancer
The new YARN framework promises to make Hadoop a general-purpose platform for Big Data and enterprise data hub applications. In this talk, you'll learn about writing and taking advantage of applications built on YARN.
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial:
1) 5 V’s of Big Data
2) Problems with Big Data
3) Hadoop-as-a solution
4) What is Hadoop?
5) HDFS
6) YARN
7) MapReduce
8) Hadoop Ecosystem
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
What if you could assemble all your data in one system and run your critical analytic applications in parallel, regardless of the format, age or location of the data? Today, thanks to the economics of Apache Hadoop-based data platforms, in particular YARN, this is possible.
SAS applications bring highly advanced, in-memory analytic processing to the data in Hadoop and enable a rich set of additional use cases with high performance analytic needs. Join this webinar and learn how, by combining their solutions, Hortonworks and SAS offer more flexibility to choose best of breed SAS HPA and LASR analytic applications in conjunction with trusted Hadoop workloads. Hear how it is possible to leverage Hadoop clusters to extend the power of SAS analytics.
You will hear directly from our experts how SAS HPA and LASR have been integrated with Hadoop YARN to:
Enable a modern data architecture without the need for fragmented processing clusters for each workload.
Ensure low-latency local data access directly from the data nodes.
Create Unified Resource Management window-panes for managing SAS HPA, LASR and HDP resources.
Speakers:
Arun Murthy, Founder and Architect at Hortonworks
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo! He jointly holds the current world sorting record using Apache Hadoop.
Paul Kent, Vice President Big Data at SAS
Paul Kent is Vice President of Big Data initiatives at SAS. He spends his time between Customers, Partners and the Research & Development teams discussing, evangelizing and developing software at the confluence of big data and high performance computing. A datacenter rack full of current-generation 64bit x86 processors represents a very large aggregate memory space, thousands of threads and plentiful IO that can be harnessed to solve problems at a much larger scale than we have traditionally been accustomed to.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
This deck covers concepts and motivations behind Apache Hadoop YARN, the key technology in Hadoop 2 to deliver a Data Operating System for the enterprise.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
Similar to Hadoop Summit 2015: A Secure Public Cache For YARN Application Resources (20)
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
A Sighting of filterA in Typelevel Rite of Passage
Hadoop Summit 2015: A Secure Public Cache For YARN Application Resources
1. C H R I S T R E Z Z O
@ C T R E Z Z O
A S E C U R E P U B L I C C A C H E F O R YA R N
A P P L I C AT I O N R E S O U R C E S
S A N G J I N L E E
@ S J L E E
2. AGENDA
YARN Localization
Shared cache design overview
Adding resources to the cache
!
Does it work?
Anti-patterns that cause churn
Admin Tips
Dev Tips
14. YARN LOCAL CACHE
Caches application resources on Node Manager local disk
Maintains a target size via an LRU eviction policy
Three resource visibilities
• Public - Shared by everyone
• User - The same user
• Application - The same application
Resources identified based on HDFS path
18. MAPREDUCE DISTRIBUTED CACHE
Caches MapReduce job resources in HDFS
API for retrieving known paths in HDFS
Sets YARN local cache visibility based on HDFS file permissions
When sharing resources it does not manage:
• Resource location
• Resource cleanup
• Resource integrity
19. RESOURCE SHARING IN PRACTICE
Little to no resource sharing between applications
• Lack of coordination at the HDFS level
• Majority of applications are MapReduce jobs that upload
resources into staging directories (i.e. application level
visibility)
21. HOW BIG IS THIS PROBLEM?
HDFS
YarnClient
Resource
Manager
NM1
Container1
NM2
Container2
NM3
Container3
NM4
Container4
22. HOW BIG IS THIS PROBLEM?
HDFS
YarnClient
Resource
Manager
NM1
Container1
NM2
Container2
NM3
Container3
NM4
Container4
~ 125 MB
per app
23. HOW BIG IS THIS PROBLEM?
HDFS
YarnClient
Resource
Manager
NM1
Container1
NM2
Container2
NM3
Container3
NM4
Container4
~ 125 MB
per app
200 KB/sec
per node1
1. Assuming 1.7 containers launched per node per minute
24. YARN SHARED CACHE
YARN-1492
Currently in production at Twitter for ~ 1 year
100’s of thousands of applications use it a day
• MapReduce jobs (i.e. Scalding/Cascading)
Working towards open source release
• The YARN feature is in 2.7 (Beta)
• Full MapReduce support coming soon (MAPREDUCE-5951)
25. DESIGN GOALS
Scalable
• Accommodate large number of cache entries
• Thousands of cached resources
• Have minimal load on Namenode and Resource Manager
• Handle spikes in cache size gracefully
26. DESIGN GOALS
Secure
• Identify resources in the cache based on their contents, not
storage location
• Trust that if the cache says it has a resource, it actually has
the resource
27. DESIGN GOALS
Fault tolerant
• YARN applications continue running without shared cache
• Shared cache should tolerate restarts
• Either persist or recreate cache resource meta data
28. DESIGN GOALS
Transparent
• YARN application developer perspective
• Management of the cache (adding/deleting resources)
• MapReduce job developer perspective
• Jobs can use the cache with no change
30. SHARED CACHE CLIENT
Interacts with shared cache manager
Responsible for
• Computing checksum of resources
• Claiming application resources
Public API:
!
Return Value
• HDFS path to resource in cache
• Null if not in the shared cache
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
Path use(String checksum, String appId);
31. HDFS DIRECTORY
Stores shared cache resources
Protected by HDFS permissions
• Globally readable
• Writing restricted to trusted user
Only modified by uploader and cleaner
Directory structure:
/sharedcache/a/8/9/a896857d078/foo.jar"
/sharedcache/5/0/f/50f11b09f87/bar.jar"
/sharedcache/a/6/7/a678cb1aa8f/job.jar
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
32. SHARED CACHE MANAGER
Serves requests from clients
Manages resources in the shared cache
• Metadata
• Persisted resources in HDFS
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
33. SHARED CACHE MANAGER
Store
• Maintains cache metadata
• Resources in the shared cache
• Applications using resources
• Implementation is pluggable
• Currently uses in-memory store
• Recreates state after restart
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
34. SHARED CACHE MANAGER
Cleaner
• Maintains persisted resources in HDFS
• Runs periodically (default: once a day)
Evicts resource if both of the following
hold:
• Resource has exceeded stale period
(default: 1 week)
• There are no live applications using
the resource
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
35. SHARED CACHE UPLOADER
Runs on Node Manager
Adds resources to the shared cache
• Verifies resource checksum
• Uploads resource to HDFS
• Notifies the shared cache manager
Asynchronous from container launch
Best-effort: failure does not impact
running applications
YARN Application!
(i.e. MapReduce)
YARN Client
Shared Cache
Client
HDFS
Shared Cache
HDFS Directory
Shared Cache
Manager
SCM
Cleaner
SCM
Store
Node
Manager
Shared Cache
Uploader
36. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
37. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
1. Client computes checksum
of resource
38. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
use(checksum, appId)
2. Notify SCM of the intent
to use resource
39. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
/sharedcache/a/
8/9/a896857d078/
foo.jar
3. SCM responds back with
the HDFS Path to the
resource in the cache
40. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
NULL
3. Or Null if the resource is
not in the cache
41. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
4. Upload resource to HDFS
(Existing step)
42. ADDING TO THE CACHE
YarnClient
Shared
Cache
Manager
NM1 NM2 NM3 NM4
Resource
Manager
HDFS
Shared
Cache
Directory
5. Set resource to be added
to cache (via LocalResource
API) and submit app
submit application
49. DOES IT WORK?
Data transfer from application submission client to HDFS
• 272.2 TB saved on production clusters (since September)
• 276.7 TB saved on ad-hoc clusters (since July)
51. BEFORE SHARED CACHE
Each application uploaded and localized (on average)
• ~ 12 resources
• ~ 125 MB total size
Each container localized (on average)
• ~ 6 resources
• ~ 63 MB total size
52. AFTER SHARED CACHE
Each application uploaded and localized (on average)
• ~ 0.16 resources
• ~ 1.7 MB total size
Each container localized (on average)
• ~ 0.08 resources
• ~ 840 KB total size
Saving localization bandwidth by 99%!
64. ANTI-PATTERNS THAT CAUSE CHURN
Local cache churn
• Local cache size too small for working set
• Competing use of public cache (e.g. pig jar cache, cascading
intermediate files)
65. ADMIN TIPS
Avoid “fat jars”
Make your jars repeatable
Set the local cache size appropriately
• yarn.nodemanager.localizer.cache.target-size-mb=65536"
Increase public localizer thread pool size
• yarn.nodemanager.localizer.fetch.thread-count=12"
Adjust the cleaner frequency to your usage pattern
• yarn.sharedcache.cleaner.period.minutes (default: 1 day)
66. DEV TIPS
YARN Developer
• Invoke use API to claim resources
• Use the LocalResource API to add resources
MapReduce Developer
• Set mapreduce.job.sharedcache.mode
• jobjar,libjar,files,archives
67. ACKNOWLEDGEMENTS
Code: Chris Trezzo, Sangjin Lee, Ming Ma
Shepherd: Karthik Kambatla
Design Review
• Karthik Kambatla
• Vinod Kumar Vavilapalli
• Jason Lowe
• Joep Rottinghuis