This document provides information about the COMP9313: Big Data Management course, including the lecturer, course aims, schedule, assessment, and resources. The course introduces concepts and technologies for managing large-scale data sets and developing big data analytics solutions. Topics include Apache Hadoop, HDFS, HBase, Hive, Pig, Spark and applications like link analysis and graph processing. Students will complete programming assignments and a final exam. Lectures will focus on frontier big data technologies and applications.
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESAAKANKSHA JAIN
Distributed Database Designs are nothing but multiple, logically related Database systems, physically distributed over several sites, using a Computer Network, which is usually under a centralized site control.
Distributed database design refers to the following problem:
Given a database and its workload, how should the database be split and allocated to sites so as to optimize certain objective function
There are two issues:
(i) Data fragmentation which determines how the data should be fragmented.
(ii) Data allocation which determines how the fragments should be allocated.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESAAKANKSHA JAIN
Distributed Database Designs are nothing but multiple, logically related Database systems, physically distributed over several sites, using a Computer Network, which is usually under a centralized site control.
Distributed database design refers to the following problem:
Given a database and its workload, how should the database be split and allocated to sites so as to optimize certain objective function
There are two issues:
(i) Data fragmentation which determines how the data should be fragmented.
(ii) Data allocation which determines how the fragments should be allocated.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Distribution transparency and Distributed transactionshraddha mane
Distribution transparency and Distributed transaction.deadlock detection .Distributed transaction and their types and threads and processes and their difference.
Agreement Protocols, Distributed Resource Management: Issues in distributed File Systems, Mechanism for building distributed file systems, Design issues in Distributed Shared Memory, Algorithm for Implementation of Distributed Shared Memory.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Distribution transparency and Distributed transactionshraddha mane
Distribution transparency and Distributed transaction.deadlock detection .Distributed transaction and their types and threads and processes and their difference.
Agreement Protocols, Distributed Resource Management: Issues in distributed File Systems, Mechanism for building distributed file systems, Design issues in Distributed Shared Memory, Algorithm for Implementation of Distributed Shared Memory.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
1. Introduction and how to get into Data
2. Data Engineering and skills needed
3. Comparison of Data Analytics for statistic and real time streaming data
4. Bayesian Reasoning for Data
Had a great pleasure and honor to give a lecture about the Current and Future Challenges in Data Science at the Nextech 2019 conference alongside an impressive list of other speakers
Paradigm4 Research Report: Leaving Data on the tableParadigm4
While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Cosmetic shop management system project report.pdf
Chapter1 introduction
1. COMP9313: Big Data ManagementCOMP9313: Big Data Management
Lecturer: Xin CaoLecturer: Xin Cao
Course web site:Course web site: http://www.cse.unsw.edu.au/~cs9313/
2. 1.2
Chapter 1: Course Infomation and Introduction toChapter 1: Course Infomation and Introduction to
Big DataBig Data
4. 1.4
Lecturer in ChargeLecturer in Charge
Lecturer: Xin Cao
Office: 201D K17 (outside the lift turn left)
Email: xin.cao@unsw.edu.au
Ext: 55932
Research interests
Spatial Database
Data Mining
Data Management
Big Data Technologies
My publications list at google scholar:
https://scholar.google.com.au/citations?user=kJIkUagAAAAJ&hl=en
5. 1.5
Course AimsCourse Aims
This course aims to introduce you to the concepts behind Big Data,
the core technologies used in managing large-scale data sets, and a
range of technologies for developing solutions to large-scale data
analytics problems.
This course is intended for students who want to understand modern
large-scale data analytics systems. It covers a wide range of topics
and technologies, and will prepare students to be able to build such
systems as well as use them efficiently and effectively address
challenges in big data management.
Not possible to cover every aspect of big data management.
6. 1.6
LecturesLectures
Lectures focusing on the frontier technologies on big data
management and the typical applications
Try to run in more interactive mode
A few lectures may run in more practical manner (e.g., like a lab/demo)
to cover the applied aspects
Lecture length varies slightly depending on the progress (of that
lecture)
Note: attendance to every lecture is assumed
BIG DATA BUG
DATA
7. 1.7
ResourcesResources
Text Books
Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly
Media
Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman,
Jeff Ullman. 2nd edition - Cambridge University Press
Reference Books and other readings
Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean
Owen, and Uri Laserson. O'Reilly Media
Apache MapReduce Tutorial
Apache Spark Quick Start
Many other online tutorials … …
Big Data is a relatively new topic (so no fixed syllabus)
8. 1.8
PrerequisitePrerequisite
Official prerequisite of this course is COMP9024 (Data Structures and
Algorithms) and COMP9311 (Database Systems).
Before commencing this course, you should:
have experiences and good knowledge of algorithm design
(equivalent to COMP9024 )
have a solid background in database systems (equivalent to
COMP9311)
have solid programming skills in Java
be familiar with working on a Unix-style operating systems
have basic knowledge of linear algebra (e.g., vector spaces, matrix
multiplication), probability theory and statistics , and graph theory
No previous experience necessary in
MapReduce
Parallel and distributed programming
9. 1.9
Please do not enrol if youPlease do not enrol if you
Don’t have COMP9024/9311 knowledge
Cannot produce correct Java program on your own
Never worked on Unix-style operating systems
Have poor time management
Are too busy to attend lectures/labs
Otherwise, you are likely to perform badly in this subject
10. 1.10
Learning outcomesLearning outcomes
After completing this course, you are expected to:
elaborate the important characteristics of Big Data
develop an appropriate storage structure for a Big Data repository
utilize the map/reduce paradigm and the Spark platform to
manipulate Big Data
use a high-level query language to manipulate Big Data
develop efficient solutions for analytical problems involving Big
Data
12. 1.12
AssignmentsAssignments
1 warm-up programming assignment on Hadoop
1 programming assignment on HBase/Hive/Pig
1 warm-up programming assignment on Spark
Another harder assignment on Hadoop
Another harder assignment on Spark
Both results and source codes will be checked.
If not able to run your codes due to some bugs, you will not lose all
marks.
13. 1.13
Final examFinal exam
Final written exam (100 pts)
If you are ill on the day of the exam, do not attend
the exam – I will not accept any medical special
consideration claims from people who already
attempted the exam.
14. 1.14
You May Fail Because …You May Fail Because …
*Plagiarism*
Code failed to compile due to a mistake of 1 char or 1 word
Late submission
1 sec late = 1 day late
submit wrong files
Program did not follow the spec
I am unlikely to accept the following excuses:
“Too busy”
“It took longer than I thought it would take”
“It was harder than I initially thought”
“My dog ate my homework” and modern variants thereof
15. 1.15
Tentative course scheduleTentative course schedule
Week Topic Assignment
1 Course info and introduction to big data
2 Hadoop MapReduce 1
3 Hadoop MapReduce 2 Ass1
4 HDFS and Hadoop I/O
5 NoSQL and Hbase Ass2
6 Hive and Pig
7 Spark Ass3
8 Link analysis
9 Graph data processing Ass4
10 Data stream mining Ass5
11 Large-scale machine learning
12 Revision and exam preparation
16. 1.16
Your Feedbacks Are ImportantYour Feedbacks Are Important
Big data is a new topic, and thus the course is tentative
The technologies keep evolving, and the course materials need to be
updated correspondingly
Please advise where I can improve after each lecturer, at the
discussion and QA website
CATEI system
18. 1.18
What is Big Data?What is Big Data?
Big data is like teenage sex:
everyone talks about it
nobody really knows how to do it
everyone thinks everyone else is doing it
so everyone claims they are doing it...
--Dan Ariely, Professor at Duke University
19. 1.19
What is Big Data?What is Big Data?
No standard definition! here is from Wikipedia:
Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, querying, updating and
information privacy.
Analysis of data sets can find new correlations to "spot business
trends, prevent diseases, combat crime and so on."
20. 1.20
Who is generating Big Data?Who is generating Big Data?
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
22. 1.22
Volume (Scale)Volume (Scale)
Data Volume
Growth 40% per year
From 8 zettabytes (2016) to 44zb (2020)
Data volume is increasing exponentially
Exponential increase in
collected/generated data
23. 1.23
How much data?How much data?
Hadoop: 10K nodes, 150K
cores, 150 PB (4/2014)
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS (5/2014)
300 PB data in Hive +
600 TB/day (4/2014)
400B pages, 10+
PB (2/2014)
LHC: ~15 PB a year
LSST: 6-10 PB a year
(~2020)640K ought to be
enough for
anybody.
150 PB on 50k+ servers
running 15k apps (6/2011)
S3: 2T objects, 1.1M
request/second (4/2013)
SKA: 0.3 – 1.5 EB
per year (~2020)
Hadoop: 365 PB, 330K
nodes (6/2014)
24. 1.24
Variety (Complexity)Variety (Complexity)
Different Types:
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be generating/collecting many types of
data
Different Sources :
Movie reviews from IMDB and Rotten Tomatoes
Product reviews from different provider websites
To extract knowledge all these types of
data need to linked together
To extract knowledge all these types of
data need to linked together
25. 1.25
A Single View to the CustomerA Single View to the Customer
Customer
Social
Media
Social
Media
GamingGaming
EntertainEntertain
Banking
Finance
Banking
Finance
Our
Known
History
Our
Known
History
PurchasePurchase
26. 1.26
A Global View of Linked Big DataA Global View of Linked Big Data
patient
doctors
gene
protein
drug
“Ebola”
mutation
diagnosis
prescription
target
tissue
Heterogeneous information networkDiversified social network
27. 1.27
Velocity (Speed)Velocity (Speed)
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your
purchase history, what you like send promotions right now
for store next to you
Healthcare monitoring: sensors monitoring your activities
and body any abnormal measurements require immediate
reaction
Disaster management and response
29. 1.29
ExtendedExtended Big Data Characteristics:Big Data Characteristics:
6V6V
Volume: In a big data environment, the amounts of data collected and
processed are much larger than those stored in typical relational
databases.
Variety: Big data consists of a rich variety of data types.
Velocity: Big data arrives to the organization at high speeds and from
multiple sources simultaneously.
Veracity: Data quality issues are particularly challenging in a big data
context.
Visibility/Visualization: After big data being processed, we need a way
of presenting the data in a manner that’s readable and accessible.
Value: Ultimately, big data is meaningless if it does not provide value
toward some meaningful goal.
30. 1.30
Veracity (Quality & Trust)Veracity (Quality & Trust)
Data = quantity + quality
When we talk about big data, we typically mean its quantity:
What capacity of a system provides to cope with the sheer size of
the data?
Is a query feasible on big data within our available resources?
How can we make our queries tractable on big data?
. . .
Can we trust the answers to our queries?
Dirty data routinely lead to misleading financial reports, strategic
business planning decision ⇒ loss of revenue, credibility and
customers, disastrous consequences
The study of data quality is as important as data quantity
31. 1.31
Data in real-life is often dirtyData in real-life is often dirty
500,000 dead people retain
active Medicare cards
81 million National Insurance
numbers but only 60 million
eligible citizens
98000 deaths each year,
caused by errors in
medical data
34. 1.34
Big Data: 6V in SummaryBig Data: 6V in Summary
Transforming Energy and Utilities through Big Data & Analytics. By Anders Quitzau@IBM
35. 1.35
Other V’sOther V’s
Variability
Variability refers to data whose meaning is constantly changing. This is
particularly the case when gathering data relies on language processing.
Viscosity
This term is sometimes used to describe the latency or lag time in the data
relative to the event being described. We found that this is just as easily
understood as an element of Velocity.
Virality
Defined by some users as the rate at which the data spreads; how often it
is picked up and repeated by other users or events.
Volatility
Big data volatility refers to how long is data valid and how long
should it be stored. You need to determine at what point is data no
longer relevant to the current analysis.
More V’s in the future …
37. 1.37
Cloud ComputingCloud Computing
The buzz word before “Big Data”
Larry Ellison’s response in 2009
Cloud Computing is a general term used to describe a new class of
network based computing that takes place over the Internet
A collection/group of integrated and networked hardware, software
and Internet infrastructure (called a platform).
Using the Internet for communication and transport provides
hardware, software and networking services to clients
These platforms hide the complexity and details of the underlying
infrastructure from users and applications by providing very simple
graphical interface or API
A technical point of view
Internet-based computing (i.e., computers attached to network)
A business-model point of view
Pay-as-you-go (i.e., rental)
40. 1.40
Cloud Computing ServicesCloud Computing Services
Infrastructure as a service (IaaS)
Offering hardware related services using the principles of cloud
computing. These could include storage services (database or
disk storage) or virtual servers.
Amazon EC2, Amazon S3
Platform as a Service (PaaS)
Offering a development platform on the cloud.
Google’s Application Engine, Microsofts Azure
Software as a service (SaaS)
Including a complete software offering on the cloud. Users can
access a software application hosted by the cloud vendor on pay-
per-use basis. This is a well-established sector.
Googles gmail and Microsofts hotmail, Google docs
41. 1.41
Cloud ServicesCloud Services
Software as a
Service (SaaS)
Platform as a
Service (PaaS)
Infrastructure as a
Service (IaaS)
Google
App
Engine
SalesForce CRM
LotusLive
42. 1.42
Why Study Big Data Technologies?Why Study Big Data Technologies?
The hottest topic in both research and industry
Highly demanded in real world
A promising future career
Research and development of big data systems:
distributed systems (eg, Hadoop), visualization tools, data
warehouse, OLAP, data integration, data quality control, …
Big data applications:
social marketing, healthcare, …
Data analysis: to get values out of big data
discovering and applying patterns, predicative analysis, business
intelligence, privacy and security, …
Graduate from UNSW
44. 1.44
What will the course coverWhat will the course cover
Topic 1. Big data management tools
Apache Hadoop
MapReduce
HDFS
HBase
Hive and Pig
Mahout
Spark
Topic 2. Big data typical applications
Link analysis
Graph data processing
Data stream mining
Some machine learning topics
45. 1.45
Philosophy to Scale for Big DataPhilosophy to Scale for Big Data
ProcessingProcessing
Divide Work
Combine
Results
46. 1.46
Distributed processing is non-trivialDistributed processing is non-trivial
How to assign tasks to different workers in an efficient way?
What happens if tasks fail?
How do workers exchange results?
How to synchronize distributed tasks allocated to different workers?
47. 1.47
Big data storage is challengingBig data storage is challenging
Data Volumes are massive
Reliability of Storing PBs of data is challenging
All kinds of failures: Disk/Hardware/Network Failures
Probability of failures simply increase with the number of machines …
48. 1.48
What is HadoopWhat is Hadoop
Open-source data storage and processing platform
Before the advent of Hadoop, storage and processing of big data was
a big challenge
Massively scalable, automatically parallelizable
Based on work from Google
Google: GFS + MapReduce + BigTable (Not open)
Hadoop: HDFS + Hadoop MapReduce +
HBase ( opensource)
Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after
his son's toy elephant.
49. 1.49
Hadoop offersHadoop offers
Redundant, Fault-tolerant data storage
Parallel computation framework
Job coordination
Programmer
s
Q: Where file is
located?
Q: How to
handle failures
& data lost?
Q: How to divide
computation?
Q: How to
program for
scaling?
No longer need
to worry about
50. 1.50
Why Use Hadoop?Why Use Hadoop?
Cheaper
Scales to Petabytes or more easily
Faster
Parallel data processing
Better
Suited for particular types of big data problems
53. 1.53
Data storage (HDFS)
Runs on commodity hardware (usually Linux)
Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
Other Tools / Frameworks
Data Access
HBase, Hive, Pig, Mahout
Tools
Hue, Sqoop
Monitoring
Greenplum, Cloudera
Hadoop is a set of Apache Frameworks andHadoop is a set of Apache Frameworks and
more…more…
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
54. 1.54
What are the core parts of a HadoopWhat are the core parts of a Hadoop
distribution?distribution?
HDFS Storage
Redundant (3 copies)
For large files – large blocks
64 or 128 MB / block
Can scale to 1000s of
nodes
MapReduce API
Batch (Job) processing
Distributed and Localized to
clusters (Map)
Auto-Parallelizable for huge
amounts of data
Fault-tolerant (auto retries)
Adds high availability and
more
Pig
Hive
HBase
Others
Other Libraries
55. 1.55
Hadoop 2.0Hadoop 2.0
Hadoop YARN (Yet Another Resource Negotiator): a resource-
management platform responsible for managing computing resources in
clusters and using them for scheduling of users' applications
Single Use System
Batch apps
Multi-Purpose Platform
Batch, Interactive, Online, Streaming
56. 1.56
Hadoop EcosystemHadoop Ecosystem
A combination of technologies which have proficient advantage in solving business problems.
http://www.edupristine.com/blog/hadoop-ecosystem-and-components
57. 1.57
Common Hadoop DistributionsCommon Hadoop Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft Azure HDInsight (Beta)
58. 1.58
Setting up Hadoop DevelopmentSetting up Hadoop Development
Hadoop
Binaries
Local install
• Linux
• Windows
Cloudera’s Demo
VM
• Need Virtualization
software, i.e. VMware,
etc…
Cloud
• AWS
• Microsoft (Beta)
• Others
Data Storage
Local
• File System
• HDFS Pseudo-
distributed (single-
node)
Cloud
• AWS
• Azure
• Others
MapReduce
Local
Cloud
Other
Libraries &
Tools
Vendor Tools
Libraries
59. 1.59
Comparing: RDBMS vs. HadoopComparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
61. 1.61
MapReduceMapReduce
Typical big data problem
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Programmers specify two functions:
map (k1, v1) [<k→ 2, v2>]
reduce (k2, [v2]) [<k→ 3, v3>]
All values with the same key are sent to the same reducer
The execution framework handles everything else…
Map
Reduce
Key idea: provide a functional abstraction
for these two operations
62. 1.62
Philosophy to Scale for Big DataPhilosophy to Scale for Big Data
ProcessingProcessing
Divide Work
Combine
Results
63. 1.63
Understanding MapReduceUnderstanding MapReduce
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate values)
One list per local
node
Can implement local
Reducer (or
Combiner)
Reduce
(K2, list(V2))
Shuffle / Sort phase
precedes Reduce phase
Combines Map output
into a list
list (K3, V3)
Usually aggregates
intermediate values
(input) <k1, v1> map <k2, v2> combine <k2, list(V2)> reduce <k3, v3> (output)
Shuffle/Sort>>
64. 1.64
WordCount - MapperWordCount - Mapper
Reads in input pair <k1,v1>
Outputs a pair <k2, v2>
Let’s count number of each word in user queries (or Tweets/Blogs)
The input to the mapper will be <queryID, QueryText>:
<Q1,“The teacher went to the store. The store was closed; the
store opens in the morning. The store opens at 9am.” >
The output would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store,1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store,1>
<opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
65. 1.65
WordCount - ReducerWordCount - Reducer
Accepts the Mapper output (k2, v2), and aggregates values on the key
to generate (k3, v3)
For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 4> <was, 1>
<closed, 1> <opens, 2> <in, 1> <morning, 1> <at, 1> <9am, 1>
66. 1.66
MapReduce Example - WordCountMapReduce Example - WordCount
Hadoop MapReduce is an implementation of MapReduce
MapReduce is a computing paradigm (Google)
Hadoop MapReduce is an open-source software
67. 1.67
AWS (Amazon Web Services)AWS (Amazon Web Services)
Amazon
From Wikipedia 2006
From Wikipedia 2016
68. 1.68
AWS (Amazon Web Services)AWS (Amazon Web Services)
AWS is a subsidiary of Amazon.com, which offers a suite of cloud
computing services that make up an on-demand computing platform.
Amazon Web Services (AWS) provides a number of different services,
including:
Amazon Elastic Compute Cloud (EC2)
Virtual machines for running custom software
Amazon Simple Storage Service (S3)
Simple key-value store, accessible as a web service
Amazon Elastic MapReduce (EMR)
Scalable MapReduce computation
Amazon DynamoDB
Distributed NoSQL database, one of several in AWS
Amazon SimpleDB
Simple NoSQL database
...
69. 1.69
Cloud Computing Services in AWSCloud Computing Services in AWS
IaaS
EC2, S3, …
Highlight: EC2 and S3 are two of the earliest products in AWS
PaaS
Aurora, Redshift, …
Highlight: Aurora and Redshift are two of the fastest growing
products in AWS
SaaS
WorkDocs, WorkMail
Highlight: May not be the main focus of AWS
70. 1.70
Setting up an AWS accountSetting up an AWS account
Sign up for an account on aws.amazon.com
You need to choose an username and a password
These are for the management interface only
Your programs will use other credentials (RSA keypairs, access
keys, ...) to interact with AWS
aws.amazon.com
71. 1.71
Signing up for AWS EducateSigning up for AWS Educate
Complete the web form on
https://aws.amazon.com/education/awseducate/
Assumes you already have an AWS account
Use your Penn email address!
Amazon says it should only take 2-5 minutes (but don’t rely on
this!!)
This should give you $100/year in AWS credits. Be careful!!!
72. 1.72
Big Data ApplicationsBig Data Applications
Link analysis
Graph data processing
Data stream mining
Large-scale machine learning
74. 1.74
NoSQLNoSQL
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the
concept of joins
All NoSQL offerings relax one or more of the ACID properties (will talk
about the CAP theorem)
75. 1.75
Why NoSQL?Why NoSQL?
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have other
data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now than even a year
ago
Think about proposing a Ruby/Rails or Groovy/Grails solution now
versus a couple of years ago
76. 1.76
What kinds of NoSQLWhat kinds of NoSQL
NoSQL solutions fall into two major areas:
Key/Value or ‘the big hash table’.
Amazon S3 (Dynamo)
Voldemort
Scalaris
Memcached (in-memory key/value store)
Redis
Schema-less which comes in multiple flavors, column-
based, document-based or graph-based.
Cassandra (column-based)
CouchDB (document-based)
MongoDB(document-based)
Neo4J (graph-based)
HBase (column-based)
78. 1.78
Common AdvantagesCommon Advantages
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical and fault-
tolerant) and can be partitioned
Down nodes easily replaced
No single point of failure
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
79. 1.79
What am I giving up?What am I giving up?
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still powerful
query language
easy integration with other applications that support
SQL