Big Data Analysis using
Hadoop on a Eucalyptus
Cloud
How secure is our cloud?
PRESENTED BY: ABHISHEK DE
STUDENT, CSE 2ND YEAR, BPPIMT
Contents:
 The Big Data Crisis
 Let’s embrace Cloud Computing
 Benefits of cloud
 Establishing an IaaS using Eucalyptus
 A word on Virtualization
 Hadoop as a Platform
 MapReduce and HDFS
 Typical algorithms
 Benefits we achieve
 How secure is the system?
PREPARED BY: ABHISHEK DE
2
06-Apr-13
The drifting era: BIG DATA and crisis
• YouTube users upload 48 hours of
new video every minute of the
day.
• 100 terabytes of data uploaded
daily to Facebook.
• Twitter sees roughly 175 million
tweets every day, and has more
than 465 million accounts.
• Walmart handles more than 1
million customer transactions
every hour, and databases more
than 2.5 petabytes of data.
PREPARED BY: ABHISHEK DE
3
06-Apr-13
DATA is
precious, too
precious..
We need
Infrastructure, which
comes easily as a
Service
06-Apr-13PREPARED BY: ABHISHEK DE
4
Solution: Cloud Computing
 Conventional Computing:
You data gets processed in your own
computer.
 Cloud computing:
You send your data to some other
computer. It gets processed there and it
comes back to you.
“Cloud Computing is the use
of computing resources (hardware and
soft ware) that are delivered as a service
over a network (typically the Internet)”
--WIKIPEDIA
PREPARED BY: ABHISHEK DE
5
06-Apr-13
Benefits of Cloud Computing:
High
reliability.
Highly scalable and
fault tolerant.
Reduced Cost: Only
pay for what you
need.
Efficient management of
resources.
Improved
Security.
Achieved out of
commodity
hardware.
PREPARED BY: ABHISHEK DE
6
06-Apr-13
Why Eucalyptus?
“Elastic Utility Computing Architecture Linking Your Programs To Useful System”
Eucalyptus is the world's most widely deployed software platform for on-premise
(private) Infrastructure as a Service (IaaS) clouds.
It uses existing infrastructure to create a scalable, secure web services layer that
abstracts compute, network and storage to offer IaaS.
Eucalyptus can be dynamically scaled up or down depending on application
workloads.
PREPARED BY: ABHISHEK DE
7
06-Apr-13
Architecture of Eucalyptus:
FRONT END:
• Users login to
the cloud
using
credentials
• The user is
redirected to
the back end
of the
cloud, i.e., the
Storage and
the Resource
pool
user1
user1@nc1:
BACK END:
• Runs the Node
Controller.
• Mounts
images as
Virtual
Machines or
instances
using XEN or
KVM.
• Hosts the
resource pool.
FRONT END BACK END
PREPARED BY: ABHISHEK DE
8
06-Apr-13
XEN: Virtualize your resources
 XEN, is the under laying technology used by
eucalyptus. Xen hypervisor allows several guest
operating systems to be executed on the same
computer hardware concurrently.
 Xen partitions a single physical machine into
multiple virtual machines, to provide server
consolidation and utility computing. Existing
applications and binaries run unmodified.
 The hypervisor controls the MMU, CPU
scheduling, and interrupt controller, presenting a
virtual machine to guests.
PREPARED BY: ABHISHEK DE
9
06-Apr-13
HADOOP: Solution to BIG DATA
PREPARED BY: ABHISHEK DE
10
 Roughly how long does it take to read 1TB from a commodity hard disk:
 That is roughly around 4 hours.
 With HADOOP it takes around :
06-Apr-13
Birth of HADOOP: Opensource
alternative to GFS
 Pre-2004 : Cutting and Cafarella develop open source projects for web-scale
indexing, crawling and search.
 2004: Jeffrey Dean and Sanjay Ghemawat introduce map reduce model used internally
at Google.
 2006: Hadoop becomes official Apache project, Cutting joins Yahoo! Yahoo adopts
Hadoop.
06-Apr-13PREPARED BY: ABHISHEK DE
11
HDFS: Hadoop Distributed File System
 Files split into 128MB (or 64MB) blocks
 Blocks replicated across several datanodes(usually 3)
 Single namenode stores metadata (file names, block
locations, etc.)
 Optimized for large files, sequential reads
 Clients read from closest replica available.(note:
locality of reference.)
 If the replication for a block drops below target, it is
automatically re-replicated.
Datanodes
1
2
3
4
1
2
4
2
1
3
1
4
3
3
2
4
Namenode
06-Apr-13PREPARED BY: ABHISHEK DE
12
Data Flow
Web Servers Scribe
Servers
Network
Storage
Hadoop ClusterOracle
RAC
MySQL
06-Apr-13PREPARED BY: ABHISHEK DE
13
HADOOP and MapReduce:
PREPARED BY: ABHISHEK DE
14
Input
Map
Shuffle/SortReduce
Output
06-Apr-13
Word Count: A typical Example
PREPARED BY: ABHISHEK DE
15
06-Apr-13
Implementation: Hardware
PREPARED BY: ABHISHEK DE
16
Move code to data (local
computation)
Allow programs to scale
transparently w.r.t size of input
Abstract away fault tolerance,
synchronization, etc.
06-Apr-13
HADOOP in
action!
 SOCIAL NETWORKING
ANALYSIS
 PAGE RANKING ANALYSIS
 ANALYTICS ENGINE WITH
MAP/REDUCE
 IMAGE PROCESSING
06-Apr-13PREPARED BY: ABHISHEK DE
17
Social Networking Analysis:
 Problem: recommend new friends (friend-of-a-friend, FOAF)
 Map task:
– U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node
stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already
friends of U
nil otherwise
 Reduce task:
– In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who
are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of
occurrences in all FOAF lists (sort/rank the result!)
06-Apr-13PREPARED BY: ABHISHEK DE
18
Pro’s and Con’s
 Batch, offline jobs
 Write-once, read-many across full
data set
 Usually, though not always, simple
computations
 I/O bound by disk/network
bandwidth
PREPARED BY: ABHISHEK DE
19
What it’s not:
 High-performance
parallel computing, e.g.
MPI
 Low-latency random
access relational
database
 Always the right solution
06-Apr-13
Cloud Security: Threats unveiled
XML SIGNATURE ATTACK:
 The original SOAP body element is moved to a newly
added bogus wrapper element in the SOAP security
header. Note that the moved body is still referenced
by the signature using its identifier attribute Id="body".
The signature is still cryptographically valid, as the
body element in question has not been modified (but
simply relocated). Subsequently, in order to make the
SOAP message XML schema compliant, the attacker
changes the identifier of the cogently placed SOAP
body (in this example he uses Id="attack"). The filling
of the empty SOAP body with bogus content can
now begin, as any of the operations denied by the
attacker can be effectively executed due to the
successful signature verification.
06-Apr-13PREPARED BY: ABHISHEK DE
20
Script Injection Attack
 targets only the AWS management console users.
 exploits the shared credentials between the amazon shop interface and AWS.
 The first vulnerability is exploits the GET parameters in the download link users
utilize for downloading their X.509 certificates issued by Amazon. However the
preconditions for the attack are rather high including use of UTF-7 encoding for
the injected script to bypass server logic to encode standard HTML characters
as well as the exploitation of features in specific IE versions.
 The second script injection attack uses a persistent cross site scripting attack by
exploiting the login session that is initiated with AWS the first time a user logs into
the Amazons hop interface
06-Apr-13PREPARED BY: ABHISHEK DE
21
Who uses it? Applications and
Innovations
Projects under Hadoop:
 HBase
 ZooKeeper
 Pig
 Zombie
 Hive
 Sqoop
PREPARED BY: ABHISHEK DE
22
06-Apr-13
References:
 http://www.eucalyptus.com/what-is-cloud-computing
 http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_p
etabyte_in_162/
 http://int3.de/res/GfsMapReduce/GfsAndMapReduce.pdf
 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/Federation.html
 http://www.change-
project.eu/fileadmin/publications/Presentations/CHANGE_-
_The_role_of_virtualisation_in_future_network_infrastructures_-
_Warsaw_cluster_workshop_contribution.pdf
 http://wiki.apache.org/hadoop/NameNode
06-Apr-13PREPARED BY: ABHISHEK DE
23
That’s the end..
But the beginning of a new
horizon..
Special thanks to the entire
team that helped me in this
endeavor.
ALL QUERIES, PLEASE CONTACT ME AT: abhishekde@hotmail.com
QUESTIONS?

Big Data Analysis on a Cloud Ecosystem-PATW 2013

  • 1.
    Big Data Analysisusing Hadoop on a Eucalyptus Cloud How secure is our cloud? PRESENTED BY: ABHISHEK DE STUDENT, CSE 2ND YEAR, BPPIMT
  • 2.
    Contents:  The BigData Crisis  Let’s embrace Cloud Computing  Benefits of cloud  Establishing an IaaS using Eucalyptus  A word on Virtualization  Hadoop as a Platform  MapReduce and HDFS  Typical algorithms  Benefits we achieve  How secure is the system? PREPARED BY: ABHISHEK DE 2 06-Apr-13
  • 3.
    The drifting era:BIG DATA and crisis • YouTube users upload 48 hours of new video every minute of the day. • 100 terabytes of data uploaded daily to Facebook. • Twitter sees roughly 175 million tweets every day, and has more than 465 million accounts. • Walmart handles more than 1 million customer transactions every hour, and databases more than 2.5 petabytes of data. PREPARED BY: ABHISHEK DE 3 06-Apr-13
  • 4.
    DATA is precious, too precious.. Weneed Infrastructure, which comes easily as a Service 06-Apr-13PREPARED BY: ABHISHEK DE 4
  • 5.
    Solution: Cloud Computing Conventional Computing: You data gets processed in your own computer.  Cloud computing: You send your data to some other computer. It gets processed there and it comes back to you. “Cloud Computing is the use of computing resources (hardware and soft ware) that are delivered as a service over a network (typically the Internet)” --WIKIPEDIA PREPARED BY: ABHISHEK DE 5 06-Apr-13
  • 6.
    Benefits of CloudComputing: High reliability. Highly scalable and fault tolerant. Reduced Cost: Only pay for what you need. Efficient management of resources. Improved Security. Achieved out of commodity hardware. PREPARED BY: ABHISHEK DE 6 06-Apr-13
  • 7.
    Why Eucalyptus? “Elastic UtilityComputing Architecture Linking Your Programs To Useful System” Eucalyptus is the world's most widely deployed software platform for on-premise (private) Infrastructure as a Service (IaaS) clouds. It uses existing infrastructure to create a scalable, secure web services layer that abstracts compute, network and storage to offer IaaS. Eucalyptus can be dynamically scaled up or down depending on application workloads. PREPARED BY: ABHISHEK DE 7 06-Apr-13
  • 8.
    Architecture of Eucalyptus: FRONTEND: • Users login to the cloud using credentials • The user is redirected to the back end of the cloud, i.e., the Storage and the Resource pool user1 user1@nc1: BACK END: • Runs the Node Controller. • Mounts images as Virtual Machines or instances using XEN or KVM. • Hosts the resource pool. FRONT END BACK END PREPARED BY: ABHISHEK DE 8 06-Apr-13
  • 9.
    XEN: Virtualize yourresources  XEN, is the under laying technology used by eucalyptus. Xen hypervisor allows several guest operating systems to be executed on the same computer hardware concurrently.  Xen partitions a single physical machine into multiple virtual machines, to provide server consolidation and utility computing. Existing applications and binaries run unmodified.  The hypervisor controls the MMU, CPU scheduling, and interrupt controller, presenting a virtual machine to guests. PREPARED BY: ABHISHEK DE 9 06-Apr-13
  • 10.
    HADOOP: Solution toBIG DATA PREPARED BY: ABHISHEK DE 10  Roughly how long does it take to read 1TB from a commodity hard disk:  That is roughly around 4 hours.  With HADOOP it takes around : 06-Apr-13
  • 11.
    Birth of HADOOP:Opensource alternative to GFS  Pre-2004 : Cutting and Cafarella develop open source projects for web-scale indexing, crawling and search.  2004: Jeffrey Dean and Sanjay Ghemawat introduce map reduce model used internally at Google.  2006: Hadoop becomes official Apache project, Cutting joins Yahoo! Yahoo adopts Hadoop. 06-Apr-13PREPARED BY: ABHISHEK DE 11
  • 12.
    HDFS: Hadoop DistributedFile System  Files split into 128MB (or 64MB) blocks  Blocks replicated across several datanodes(usually 3)  Single namenode stores metadata (file names, block locations, etc.)  Optimized for large files, sequential reads  Clients read from closest replica available.(note: locality of reference.)  If the replication for a block drops below target, it is automatically re-replicated. Datanodes 1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 Namenode 06-Apr-13PREPARED BY: ABHISHEK DE 12
  • 13.
    Data Flow Web ServersScribe Servers Network Storage Hadoop ClusterOracle RAC MySQL 06-Apr-13PREPARED BY: ABHISHEK DE 13
  • 14.
    HADOOP and MapReduce: PREPAREDBY: ABHISHEK DE 14 Input Map Shuffle/SortReduce Output 06-Apr-13
  • 15.
    Word Count: Atypical Example PREPARED BY: ABHISHEK DE 15 06-Apr-13
  • 16.
    Implementation: Hardware PREPARED BY:ABHISHEK DE 16 Move code to data (local computation) Allow programs to scale transparently w.r.t size of input Abstract away fault tolerance, synchronization, etc. 06-Apr-13
  • 17.
    HADOOP in action!  SOCIALNETWORKING ANALYSIS  PAGE RANKING ANALYSIS  ANALYTICS ENGINE WITH MAP/REDUCE  IMAGE PROCESSING 06-Apr-13PREPARED BY: ABHISHEK DE 17
  • 18.
    Social Networking Analysis: Problem: recommend new friends (friend-of-a-friend, FOAF)  Map task: – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U nil otherwise  Reduce task: – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) 06-Apr-13PREPARED BY: ABHISHEK DE 18
  • 19.
    Pro’s and Con’s Batch, offline jobs  Write-once, read-many across full data set  Usually, though not always, simple computations  I/O bound by disk/network bandwidth PREPARED BY: ABHISHEK DE 19 What it’s not:  High-performance parallel computing, e.g. MPI  Low-latency random access relational database  Always the right solution 06-Apr-13
  • 20.
    Cloud Security: Threatsunveiled XML SIGNATURE ATTACK:  The original SOAP body element is moved to a newly added bogus wrapper element in the SOAP security header. Note that the moved body is still referenced by the signature using its identifier attribute Id="body". The signature is still cryptographically valid, as the body element in question has not been modified (but simply relocated). Subsequently, in order to make the SOAP message XML schema compliant, the attacker changes the identifier of the cogently placed SOAP body (in this example he uses Id="attack"). The filling of the empty SOAP body with bogus content can now begin, as any of the operations denied by the attacker can be effectively executed due to the successful signature verification. 06-Apr-13PREPARED BY: ABHISHEK DE 20
  • 21.
    Script Injection Attack targets only the AWS management console users.  exploits the shared credentials between the amazon shop interface and AWS.  The first vulnerability is exploits the GET parameters in the download link users utilize for downloading their X.509 certificates issued by Amazon. However the preconditions for the attack are rather high including use of UTF-7 encoding for the injected script to bypass server logic to encode standard HTML characters as well as the exploitation of features in specific IE versions.  The second script injection attack uses a persistent cross site scripting attack by exploiting the login session that is initiated with AWS the first time a user logs into the Amazons hop interface 06-Apr-13PREPARED BY: ABHISHEK DE 21
  • 22.
    Who uses it?Applications and Innovations Projects under Hadoop:  HBase  ZooKeeper  Pig  Zombie  Hive  Sqoop PREPARED BY: ABHISHEK DE 22 06-Apr-13
  • 23.
    References:  http://www.eucalyptus.com/what-is-cloud-computing  http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_p etabyte_in_162/ http://int3.de/res/GfsMapReduce/GfsAndMapReduce.pdf  http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/Federation.html  http://www.change- project.eu/fileadmin/publications/Presentations/CHANGE_- _The_role_of_virtualisation_in_future_network_infrastructures_- _Warsaw_cluster_workshop_contribution.pdf  http://wiki.apache.org/hadoop/NameNode 06-Apr-13PREPARED BY: ABHISHEK DE 23
  • 24.
    That’s the end.. Butthe beginning of a new horizon.. Special thanks to the entire team that helped me in this endeavor. ALL QUERIES, PLEASE CONTACT ME AT: abhishekde@hotmail.com QUESTIONS?