This presentation discuses a suggested algorithm to improve Hadoop's Algorithm. This Algorithm is taken from a research paper that has been presented in IEEE.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability.
To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends.
Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization.
Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability.
Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier.
In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability.
To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends.
Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization.
Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability.
Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier.
In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
February 2015 Hive User Group meetup at LinkedIn
http://www.meetup.com/Hive-User-Group-Meeting/events/219794523/
Presentation about physical join strategies employed used by Apache Hive and how they may be employed to optimize workflows.
This presentation explains why NoSQL databases came over SQL databases although SQL databases has been successfully technology for more than twenty years. Moreover, This presentation discuses the characteristics and classifications of NoSQL databases. Finally, These slides cover four NoSQL databases briefly.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
February 2015 Hive User Group meetup at LinkedIn
http://www.meetup.com/Hive-User-Group-Meeting/events/219794523/
Presentation about physical join strategies employed used by Apache Hive and how they may be employed to optimize workflows.
This presentation explains why NoSQL databases came over SQL databases although SQL databases has been successfully technology for more than twenty years. Moreover, This presentation discuses the characteristics and classifications of NoSQL databases. Finally, These slides cover four NoSQL databases briefly.
Tutorial on Deep learning and ApplicationsNhatHai Phan
In this presentation, I would like to review basis techniques, models, and applications in deep learning. Hope you find the slides are interesting. Further information about my research can be found at "https://sites.google.com/site/ihaiphan/."
NhatHai Phan
CIS Department,
University of Oregon, Eugene, OR
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
Data Analytics and IoT, how to analyze data from IoTAmmarHassan80
IoT Security and how to use data analytics with Iot.
IoT (Internet of Things) security refers to the practices, technologies, and strategies used to safeguard Internet-connected devices and networks from security breaches and unauthorized access. As IoT devices become increasingly prevalent in various sectors such as healthcare, manufacturing, transportation, and smart homes, ensuring their security is crucial to protect sensitive data, prevent disruptions, and mitigate potential risks. Here's an overview of IoT security and how data analytics can be utilized within IoT systems:
Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar.
"I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them."
Watch the video: http://wp.me/p3RLHQ-gtH
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Highlights of Features Coming Soon in HPCC Systems 6.0.0!
Come learn how the upcoming 6.0 release can help you solve Big Data problems faster and more efficient. Topics include:
· How using the new Virtual slave Thor makes using a smart/lookup join faster
· How to add and leave tracing in your code without affecting the graph
· How the HPCC Systems Visualisations Framework provides easy and fast access to visualisations from data included in a workunit or Roxie query
· Plus, hear how our success with GSoC (Google Summer of Code) in 2015 is preparing us for this year
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Suggested Algorithm to improve Hadoop's performance.
1. Research on Scheduling Scheme
for Hadoop clusters
By: Jiong Xiea,b, FanJun Mengc, HaiLong Wangc, HongFang Panb, JinHong
Chengb, Xiao Qina
04/22/14 1CSC 8710
2. Outlines
• What is Hadoop?
• Hadoop Characterstics
• Hadoop Objectives
• Big Data Challenges
• Hadoop Architecture
• What is the predictive schedule and prefetching
mechanism ?
• Hadoop Issues
• Hadoop Scheduler
• PSP Scheduler
• Conclustion
04/22/14 2CSC 8710
3. Goal
• Designing prefetching mechanism to solve
the data moving problem in mapReducing
and to improve the performance.
04/22/14 CSC 8710 3
4. What is Hadoop?
• Hadoop is an open source software
framework that is used to deal with the
large amount of data and to process them
on clusters of commodity hardware.
04/22/14 4CSC 8710
5. Characteristics
• It is a framework of tools
- Not a particular program as some people think
• Open source tools.
• Distributed under apache license .
• Linux based tools.
• It works on a distributed models
- Not one big powerful computer, but numerous low
cost computers.
04/22/14 5CSC 8710
6. objectives
• Hadoop supports running of application on
Big Data.
• Therefore, Hadoop addresses Big Data
challenges.
Hadoop
Running application
on Big Datasupports
04/22/14 6CSC 8710
8. Why Do We need Hadoop?
• Powerful computer can process data until some
point when the quantity of data becomes larger
than the ability of the computer.
• Now, we need Hadoop tool to deal with this
issue.
• Hadoop uses different strategy to deal with data.
04/22/14 8CSC 8710
9. Hadoop Functionality
• Hadoop breaks up the data into smaller pieces
and distribute them equally on different nodes to
be processed at the same time.
• Similarly, Hadoop divides the computation into
the nodes equally.
• Results are combined all together then sent
again to the application
04/22/14 9CSC 8710
10. Hadoop Functionality
Node Node
Big Data
Node
Combined
Result
Dividing the data equally
computation
Returning the result
Input data
Combining the result
04/22/14 10CSC 8710
11. Architecture
• Hadoop consists of two main components:
– MapReduce: divides the workload into smaller pieces
– File System (HDFS): accounts for component failure, and it
keeps directory for all the tasks
– There are other projects provide additional functionality:
• Pig
• Hive
• HBase
• Flume
• Mahout
• Oozie
• Scoop
MapReduce File System
HDFS
Hadoop
04/22/14 11CSC 8710
12. Architecture
• Slave computers consist of 2
components:
- Task Tracker: to process the given task, and it
represents the mapReduce component.
- Data Node: to manage the piece of task that has
been give to the task tracker, and it represents HDFS.
04/22/14 12CSC 8710
13. Architecture
• The master computer consists of 4
components:
- Job Tracker: It works under mapReduce component so it breaks up the
task into smaller pieces and divides them equally on the Task Trackers.
- Task Tracker: to process the given task.
- Name Node: It is responsible to keep an index of all the tasks.
- Data Node: to manage the piece of task that has been give to the
task tracker.
04/22/14 13CSC 8710
15. Fault Tolerance for Data
• Hadoop keeps three copies of each file, and each copy is
given to a different node.
• If any one of the Task Tracker fails The Job Tracker will
detect that failure and will ask another Task Tracker to
take care of that job.
• Tables in The Name node will be backed up as well in
different computer, and this is the reason why the
enterprise version of Hadoop keeps two masters. One is
the working master and the other one is back up master.
04/22/14 15CSC 8710
16. Scalability cost
• The scalability cost is always linear. If you
want to increase the speed, increase the
number of computers.
04/22/14 16CSC 8710
17. predictive schedule and prefetching
• implementing a predictive schedule and
prefetching (PSP) mechanism on Hadoop tools
to improve the performance.
• Predictive scheduler:
- A flexible task scheduler, predicts the most appropriate task
trackers to the next data.
• Prefetching module:
– The responsible part of forcing the preload workers threads to
start loading data to main memory of the node before the
current task finish. It depends on estimated time.
04/22/14 17CSC 8710
18. PSP
• Factors that make PSP possible:
- Underutilization of CPU.
- Importance of MapReduce performance
- The storage availability in HDFS
- Interaction between the nodes
04/22/14 18CSC 8710
19. Hadoop’s Issue
• In the current MapReduce model, all the tasks are
managed by the master node, so the computation nodes
ask the master node to assign the new task to be
processed.
• The master node will tell the computing nodes what the
next task is, and where it is located.
• That will waste some of the CPU’s time while the
computation node communicates with the master node.
04/22/14 19CSC 8710
20. Hadoop’s Issue
• The original Hadoop assigns tasks randomly
from local or remote disk to the computation
node whenever the data is required.
• CPU of the computing nodes won’t process until
all the input data resources are loaded into the
main memory.
• This affects Hadoop’s performance negatively.
04/22/14 20CSC 8710
21. Prefetching
• It will force the preload workers threads to start
loading data from the local desk to the main
memory of the node before the current task
finish.
• The waiting time will be reduced, so the task will
be processed on time.
• Improving the performance of MapReduce
system.
04/22/14 21CSC 8710
22. Hadoop Scheduler
• The original Hadoop scheduler, The job tracker includes
the task scheduler module assign tasks to different tasks
trackers.
• Task Trackers periodically send heartbeat to the job
tracker.
• The job tracker checks the heartbeat and send tasks to
the available one.
• The scheduler assigns tasks randomly to the nodes via
the same heartbeat message protocol.
• It assigns tasks randomly and mispredict stragglers in
many cases.
04/22/14 22CSC 8710
23. Predictive Scheduler
• Making a predictive scheduler by designing a
prediction algorithm integrated with the original
Hadoop.
• The predictive scheduler predicts stragglers and
find the appropriate data blocks.
• The prediction decisions are made by a
prediction module during the prefetching stage.
04/22/14 23CSC 8710
25. Lunching Process
• Three basic steps to lunch the tasks:
- Copying the job from the shared file system to the job
tracker’s file system, and copying all the required
files.
- Creating a local directory of the task and un-jar the
content of the jar into the directory.
- Copying the task to the task tracker to be processed.
04/22/14 25CSC 8710
26. Lunching Process
• In PSP, all the last steps are monitored
by the prediction module, and it
predicts three events:
- The finish time of the current processed task.
- Tasks that are going to be assigned to the task
trackers
- Lunch time of the pending tasks.
04/22/14 26CSC 8710
27. prefetching
• These three issued must be addressed:
- When to prefetch:
- What to prefetch
- How much to prefetch
04/22/14 27CSC 8710
28. Conclusion
• Proposing a predictive scheduling and prefetching
mechanism (PSP) aim to enhance Hadoop performance.
• prediction module predicts data blocks to be accessed
by computing nodes in a cluster.
• the prefetching module preloads these future set of data
in the cache of the nodes.
• It has been applied on 10 nodes, so it reduces the
execution time up to 28% and 19% for the average.
• It increases the overall throughput and the I/O utilization.
04/22/14 28CSC 8710