This document summarizes a workshop on data analytics using big data tools held at Bharathiar University. It discusses the growth of data, limitations of conventional approaches to data analysis, and how the Hadoop framework addresses these issues. The key components of Hadoop including HDFS and MapReduce are explained. HDFS architecture and operations are described in detail.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
This Edureka Big Data tutorial helps you to understand Big Data in detail. This tutorial will be discussing about evolution of Big Data, factors associated with Big Data, different opportunities in Big Data. Further it will discuss about problems associated with Big Data and how Hadoop emerged as a solution. Below are the topics covered in this tutorial:
1) Evolution of Data
2) What is Big Data?
3) Big Data as an Opportunity
4) Problems in Encasing Big Data Opportunity
5) Hadoop as a Solution
6) Hadoop Ecosystem
7) Edureka Big Data & Hadoop Training
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
This presentation about Hadoop training will help you understand the need for Hadoop, what is Hadoop and concepts including Hadoop ecosystem, Hadoop features, how HDFS works, what is MapReduce and how YARN works. Finally, we will implement a banking case study using Hadoop. To solve the issue of rapidly increasing data, we need big data technologies such as Hadoop, Spark, Storm, Cassandra and many more. Hadoop can store and process vast volumes of data. You will understand the architecture of HDFS, MapReduce workflow and the architecture of YARN. In the demo, you will learn in detail on how to export data from RDBMS (MySQL) into HDFS using Sqoop commands. Now, let us get started and gain expertise with Hadoop training video.
Below topics are explained in this Hadoop training presentation:
1. Need for Hadoop
2. What is Hadoop
3. Hadoop ecosystem
4. Hadoop features
5. What is HDFS
6. What is MapReduce
7. What is YARN
8. Bank case study
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
This Edureka Big Data tutorial helps you to understand Big Data in detail. This tutorial will be discussing about evolution of Big Data, factors associated with Big Data, different opportunities in Big Data. Further it will discuss about problems associated with Big Data and how Hadoop emerged as a solution. Below are the topics covered in this tutorial:
1) Evolution of Data
2) What is Big Data?
3) Big Data as an Opportunity
4) Problems in Encasing Big Data Opportunity
5) Hadoop as a Solution
6) Hadoop Ecosystem
7) Edureka Big Data & Hadoop Training
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
This presentation about Hadoop training will help you understand the need for Hadoop, what is Hadoop and concepts including Hadoop ecosystem, Hadoop features, how HDFS works, what is MapReduce and how YARN works. Finally, we will implement a banking case study using Hadoop. To solve the issue of rapidly increasing data, we need big data technologies such as Hadoop, Spark, Storm, Cassandra and many more. Hadoop can store and process vast volumes of data. You will understand the architecture of HDFS, MapReduce workflow and the architecture of YARN. In the demo, you will learn in detail on how to export data from RDBMS (MySQL) into HDFS using Sqoop commands. Now, let us get started and gain expertise with Hadoop training video.
Below topics are explained in this Hadoop training presentation:
1. Need for Hadoop
2. What is Hadoop
3. Hadoop ecosystem
4. Hadoop features
5. What is HDFS
6. What is MapReduce
7. What is YARN
8. Bank case study
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Learning “Data Analysis with R” @LearnSocial not only adds to your existing analytics knowledge but also equips you with what the industry is looking for.
Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
Context-aware recommender systems better identify interesting items for users by adapting their suggestions to the specific contextual situations, e.g., to the current weather, if an excursion is to be recommended . But, the cold-start problem may jeopardise the quality of the recommendations: for users, items or contextual situations that are new to the system, recommendations are hard to compute. We have developed a number of novel techniques to tame this problem, and in particular, new hybrid algorithms that combine several, simpler, algorithms in order to exploit their strengths and avoid their weaknesses. We have also developed algorithms for actively identifying the most useful preference information to ask the user in order to bootstrap the system. Our results obtained from a series of offline and online experiments reveal that the proposed techniques can effectively alleviate the cold-start problem of context-aware recommender systems.
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
1. Workshop on data analytics
using big data tools ‘ 2016 –
bharathiar uniVErsity
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
2. introduction to
prEsEntEd by
K.SANTHIYA
ph.d rEsEarch scholar
dEpartmEnt of computEr
applications
bharathiar uniVErsity
undEr thE guidancE of
dr.V.bhuVanEsWari
assistant profEssor
dEpartmEnt of computEr
applications
bharathiar uniVErsityK.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
3. agEnda
• WORLD OF DATA
Few Instances
• CONVENTIONAL APPROACHES
Limitations
• HADOOP FRAMEWORK
Terminology Review
• HADOOP COMPONENTS
HDFS & MAPREDUCE
• HDFS – IN DETAIL
• HADOOP ECOSYSTEM
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
4. data EXplosion
2.5 quintillion bytes of data is
created each day…..
1
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
5. World WidE data
Since the
beginning of
Time
Last two years
2
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
6. 2.9 375 20 24 50 700 1.3 72
Million MB Hrs PB Million Billion Exabytes items
thE World of data
3
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
7. minimum sizE that a big data
filE starts With is at lEast
1 tErabytE
4
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
8. 5
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
9. &
6
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
10. conVEntional
approachEs
RDBMS
OS FILE SYSTEM
SQL QUERIES
CUSTOM FRAMEWORK
* C / C++
* PERL
* PYTHON
35
7
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
11. issuEs in lEgacy
systEms
LIMITED STORAgE CAPACITY
LIMITED PROCESSINg CAPACITY
NO SCALABILITY
SINgLE POINT OF FAILURE
SEQUENTIAL PROCESSINg
RDBMSS CAN HANDLE STRUCTURED DATA
REQUIRES PREPROCESSINg OF DATA
INFORMATION IS COLLECTED ACCORDINg
TO CURRENT BUSINESS NEEDS
8
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
12. How do we
mine (and mind)
all this data?
HOW TO RESOLVE ALL THESE
ISSUES?
9
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
13. Mr. HADOOP sAys He HAs
A sOlutiOn tO Our BiG
PrOBleM !
1
0K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
14. 1
1K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
15. 43
1
2K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
16. COMPAnies usinG
1
3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
17. WHAt is
APACHe HADOOP is A frAMeWOrk tHAt
AllOWs
fOr tHe DistriButeD PrOCessinG Of lArGe
DAtAsets ACrOss Clusters Of COMMODity
COMPuters usinG A siMPle PrOGrAMMinG
MODel.
Concept
Moving computation is more efficient than moving
large data
1
4K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
23. 2
0K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
24. HADOOP COre serViCes
i. nAMe nODe
ii.DAtA nODe
iii.resOurCe MAnAGer
iV.APPliCAtiOn MAster
V.nODe MAnAGer
Vi.seCOnDAry nAMe nODe
2
1K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
25. HDFS – REAL LIFE CONNECT
• A college library was gifted a massive collection of books by a patron. The
books were very popular titles. The librarian decided to arrange the books in
a small rack, and distribute multiple copies of each book in other racks, so
that students can find the books easily. Similarly, HDFS creates multiple
copies of a data block, and keeps them in separate systems for easy access.
2
2
K.Santhiya , Ph.d Research
Scholar , Dr.V.Bhuvaneswari,
26. WHAT IS HDFS
• Hadoop distributed File system
Highly Fault tolerant , distributed , reliable ,
scalable file system for data storage.
Stores multiple copies of data on different
nodes
A File is split up into blocks and stored on
multiple machines
Hadoop cluster typically has a single
namenode and no. of data nodes to form a
hadoop cluster.
2
3
K.Santhiya , Ph.d Research
Scholar , Dr.V.Bhuvaneswari,
27. HDFS BLOCKS
• Files are broken in to large blocks.
Typically 128 MB block size
Blocks are replicated for reliability
One replica on local node, Another replica on a remote rack,
Third replica on local rack, Additional replicas are randomly placed
2
4K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
28. HDFS BLOCKS CONTD.,
ADVANTAGES OF HDFS BLOCKS
Fixed Size
Chunk of file < block size : Only needed space is
used.
Eg : 420 MB file is split as
2
5K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
35. NAMENODE IN HA MODE
3
2K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
36. NAME NODE HA ARCHITECTURE
3
3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
37. BUSINESS SCENARIO
olivia tyler is the evp of it operations
with
nutri worldwide, inc.,and she has
decided to use hdfs for storing big data.
she will use hdfs shell to store the data
in a hadoop file system, and she will
execute various commands on it.
3
4K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
38. 3
5K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
41. data transfer components
3
8K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
42. data store components
• following are the data store components of
the hadoop ecosystem.
DISTRIBUTED
SCALABLE
BIG DATA STORE
SCALABLE
CONSISTENT
DISTRIBUTED
STRUCTURED KEY
VALUE STORE
SORTED
DISTRIBUTED KEY
VALUE DATA
STORAGE AND
RETRIEVAL SYSTEM
HBASE CASSANDRA ACCUMULO
3
9K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
43. serialization components
• The serialization components are Avro,
Trevni, and Thrift.
• Avro is a data serialization system.
• Trevni is a column file format used to
permit compatible, independent
implementations that read and /or write
files in this format.
• Thrift is a framework for scalable, cross-
language services development. 4
0
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
44. Job execution components
• Following are the job execution components :
4
1K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
46. conclusion
56
4
3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
47. references
• J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger digital shadows,
and biggest growth in the far east,'' in Proc. IDC iView,IDC Anal. Future, 2012.
• (2015) Available : [online] http://expandedramblings.com/index.php/by-the-numbers-a-
gigantic-list-of-google-stats-and-facts/
• D. Evans and R. Hutley, ``The explosion of data,'' white paper, 2010.
• Seema Acharya, Subhashini Chelleppan " Big Data and Analytics "Wiley India Pvt Ltd ,
2015
• Dhruba Borthakur , " HDFS Architecture Guide " , 2013.
• Available:[Online]http:// hortonworks.com/hadoop/flume/#section_2
• Marko Grobelnik , " Big-Data tutorial" , white paper,2012.
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
48. 4
4K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016