How to Run Applications Faster ?

      Research Issues in P2P                     • There are 3 ways to improve performance:
                                                    – Work Harder
            Computing                               – Work Smarter
                                                    – Get Help
                                                 • Computer Analogy
                                                    – faster hardware high performance processors or
                                                      peripheral devices
                                                    – Optimized algorithms and techniques used to
                                                      solve computational tasks
                                                    – Multiple computers to solve a particular task




                                                                  Distributed….
                     OUTLINE
                                                • When a handful of powerful computers are
•   Centralized Vs. Distributed                   linked together and communicate with each
•   What is P2P?                                  other
•   P2P Architectures                             – the overall computing power available can be
•   P2P and Applications                            amazingly vast.
•   Search and Replication Techniques             – Such a system can have a higher performance share
•   P2P Security                                    than a single supercomputer.
•   Emerging P2P Applications                     – The objective of such systems is to minimize
•   Conclusion                                      communication and computation cost.




                    Centralized?
• Computation in networks of processing          • Distributed system is an application that executes a
  nodes can be classified into centralized or
  distributed computations.                        collection of protocols to coordinate the actions of
• A centralized solution relies on one node        multiple processes on a communication network,
  being designated as the computer node
  that processes the entire application            such that all components cooperate together to
  locally                                          perform a single or small set of related tasks.
• The central system is shared by all the
  users all the time.
• There is single point of control and
  single point of failure.




                                                                                                          1
Examples of Distributed Systems
                                                               • The Internet
• The collaborating computers can access remote                   – Heterogeneous
  resources as well as local resources in the                     network of computers
  distributed system via the communication network.               and applications
• The existence of multiple autonomous computers is               – Implemented
                                                                  through the Internet
  transparent to the user in a distributed system.
                                                                  Protocol Stack
  – The user is not aware that the jobs are executed by
    multiple computers subsist in remote locations.




  – A centralized algorithm is at the heart of a single
    computer.

  – A distributed algorithm is at the heart of a society of
    computers




    Computer Networks vs. Distributed Systems                                              Distributed….
                                                              • Distributed systems are built up on top of existing networking and
                                                                operating systems software.
• Computer Network: the autonomous computers are
                                                              • The Middleware enables computers to coordinate their activities
  explicitly visible                                            and to share the resources of the system
                                                                 – Middleware is the bridge that connects distributed applications across
• Distributed System: existence of multiple                        dissimilar physical locations, with dissimilar hardware platforms, network
  autonomous computers is transparent                              technologies, operating systems, and programming languages.
                                                              • Middleware provides standard services such as naming, concurrency
• Many problems in common                                       control, event distribution, authorization to specify access rights to
                                                                resources, security etc.
• Normally, every distributed system relies on
  services provided by a computer network.




                                                                                                                                                2
Computing Platforms Evolution: Breaking
                      Administrative Barriers                                                                                               Foster-Kesselman
                                                                                                                  • The Foster-Kesselman duo organized in                  Ian Foster
                                                                                                                    1997, at Argonne National Laboratory,                  Mathematics and Computer
                                                                                                                    a workshop entitled “Building a                        Science Division
                                                                                                                    Computational Grid”.
                                               2100   2100   2100
                                       2100




                                                                                                                                                                           Argonne National Laboratory
                                                                                                                                                                           Argonne, IL 60439
P                          ?                                                                                      • At this moment the term “Grid” was
E                                                                                                                   born.
R
                                                                                                                  • The workshop was followed in 1998 by
                                       2100    2100   2100   2100




F                              2100

                                                                                    Administrative Barriers
O
R                                                                                                                   the publication of the book “The Grid:
M
                                                                                     Individual
                                                                                     Group
                                                                                                                    Blueprint for a New Computing
A
N                                                                                    Department                     Infrastructure” by Foster and
C                                                                                    Campus                         Kesselman themselves.                                  Carl Kesselman
E                                                                                    State                                                                                 Information Sciences Institute
                                                                                     National                     • For these reasons they are not only to                 University of Southern
                                                                                     Globe
                                                                                     Inter Planet
                                                                                                                    be considered the fathers of the Grid                  California
                                                                                     Universe                       but their book, which in the meantime                  Marina del Rey, CA 90292
                                                                                                                    was almost entirely rewritten and re-
                                                                                                                    published in 2003, is also considered the
         Desktop             SMPs or           Local                 Enterprise       Global      Inter Planet
                                                                                                                    “Grid bible”.
      (Single Processor)    SuperCom          Cluster               Cluster/Grid
                              puters                                               Cluster/Grid Cluster/Grid ??




                           The Need for Collaboration?                                                                              Electric Grid and Grid Computing
                                                                                                                  • Computing grids are conceptually not unlike
    • The worldwide business demands intense                                                                        electrical grids.
      problem-solving capabilities for incredibly                                                                 • Electric power grid - a variety of resources
                                                                                                                    contribute power into a shared "pool" for many
      complex problems                                                                                              consumers to access on an as-needed basis.
       – the need for dynamic collaboration of many                                                                  – In an electrical grid, wall outlets allows us to
                                                                                                                        link to an infrastructure of resources that
         computing resources to be able to work together.                                                               generate, distribute, and bill for electricity.
    • This is a difficult challenge across all the technical                                                         – When you connect to the electrical grid, you
                                                                                                                        don‟t need to know where the power plant is
      communities to achieve this level of resource                                                                     or how the current gets to you.
      collaboration within the bounds of the necessary                                                            • Grid computing uses middleware to coordinate
                                                                                                                    disparate IT resources across a network,
      quality requirements of the end user.                                                                         allowing them to function as a virtual whole.
                                                                                                                     – The goal of a computing grid, like that of
                                                                                                                        the electrical grid, is to provide users with access
                                                                                                                        to the resources they need, when they need them.




                                      Why Grids ?
                    Large Scale Exploration needs them

    Solving technology problems using computer
      modeling, simulation and analysis
                                                                                                Geographic
                                                                                                Information
                                                                                                Systems


Life Sciences                           Aerospace




     CAD/CAM
                                                                Military Applications




                                                                                                                                                                                                            3
CERN’s Large Hadron Collider                                                                  Client-Server Model
1800 Physicists, 150 Institutes, 32 Countries                                     The most widely used




                                                                                                Client   invocatio n                                             Server
                                                                                                                                           invocatio n


                                                                                                         result                               result
                                                                                                                       Server



                                                                                                Client
  100 PB of data by 2010; 50,000 CPUs?                                                                                 Key:
                                                                                                                                Process:                 Computer:




                                                                                    Source

The Large Hadron Collider (LHC)                                                     Router
   A gigantic scientific instrument near Geneva
   It is a particle accelerator used by physicists to study the smallest known   “Interested”
   particles – the fundamental building blocks of all things.                      End-host




                                                                                                          Client-Server




                            Why P2P?
                                                                                    Source

                                                                                    Router

                                                                                 “Interested”
                                                                                   End-host




                                                                                                                                                                          4
Client-Server                                                                          Why P2P?
                                              Overloaded!



                                                                                            Personal Computers
                                                                                             80% idle CPU time


                                                                                                                  Internet
                                                                                          Laptop
                                                                                           90% idle CPU time



   Source
                                                                                           Computers in our Lab
   Router                                                                                   99% idle CPU time
                                                                                             !                       Hot Spots become hotter
“Interested”
  End-host




                                                                                                     What is driving P2P?
     Problem with Client-Server Model
                                                                                        • Clients are not so dumb.
      – Scalability                                                                     • Billions of Mhz CPU, tons of terabytes
            • As the number of users increases, there is a higher                         disk, millions of gigabits network
              demand for computing power, storage space, and
                                                                                          bandwidth, …
              bandwidth associated with the server-side
      – Reliability                                                                       – Unused resources.
            • The whole network will depend on the highly loaded
              server to function properly




               Computer System Taxonomy                                                            P2P – An overlay network
                                                                                        • P2P overlay network
                                                                                                                                        C
                                Computer Systems                                           – The connected nodes                                        E

                                                                                             construct a virtual overlay
          Centralized Systems                                                                network on top of the                              F
                                                 Distributed Systems
          (mainframes, SMPs)                                                                 underlying network
                                                                                             infrastructure                       B

                            Client - server                                                                                            C
                                                                       Peer- to- Peer      – Peer-to-peer network                                       E

                                                                                             topology is a virtual overlay   A
                                                                                             at application layer                               F

                                                                                                                                                        G
                                                                                                                                  B

                                                                                                                                                    D

                                                                                                                                                            30




                                                                                                                                                                 5
Typical Characteristics
                                                                                                             • Large Scale: lots of nodes (up to millions)
                                    Internet
                           Client
               Client                 Cache                                                                  • Dynamicity: frequent joins, leaves, failures
      Client                          Proxy     Client
                                                                                                             • Little or no infrastructure
  Client
                  server     server                  Client             Peer-to-peer model                      – No central server

                   Congestion zone
                                                                                                             • Symmetry: all nodes are “peers” – have same role
   Client                                                          Client/      Client/     Client/
                                           Client                  Server       Server
               Client                                                                       Server
                             Client
                                                    Client/
                                                    Server                                         Client/
 Client/server model                                                server server                  Server

                                                    Client/      Congestion zone Client/
                                                    Server
                                                                                          Server
                                                              Client/
                                                              Server         Client/
                                                                             Server




                                    What is it...                                                              P2P Dominates Internet Traffic
• P2P computing is the sharing of computer resources and
  services by direct exchange between systems.
• These resources and services include the exchange of                                                       • P2P has dominated Internet traffic
  information, processing cycles, cache storage, and disk storage                                                               In 2006, more than 60% of Internet traffic
  for files.
• P2P computing takes advantage of existing desktop computing
  power and networking connectivity,
   – allowing economical clients to leverage their collective
      power to benefit the entire enterprise.
• In a P2P architecture, computers that have traditionally been
  used solely as clients communicate directly among themselves
  and can act as both clients and servers, assuming whatever role
  is most efficient for the network.
• Each node (peer) called servent acts as both a SERVer and a
  cliENT




 Shared folder, neighbors
 Client and server
                                                                                                                  Some Statistics about P2P Systems
                                                                        Peer
                                           Peer                                                              • More than 200 million users registered with skype,
            Peer                                                                                               around 10 million on-line users. (2007)
                        Search
                                                                Peer                      Peer               • Around 4.7M hosts participate SETI@Home (2006)
                              Peer                                                                           • BT accounts for 1/3 of Internet traffic (2007)
                                                                                                             • More than 200,000 simultaneous online users on PPLive
                   Retrieve                                                                                    (streaming video network). (2007)
                     File                                         Peer
                                           Peer                                                              • More than 3,000,000 users downloaded PPStream. (2008)
           Peer
                                    Peer                        Peer


                                                                                                                                                                             36




                                                                                                                                                                                  6
P2P Applications
                                                                                        • In Peer-to-Peer (P2P) computing, applications are
                                                                                          segregated into three main categories:
                                                                                           – distributed computing,
                                                                                           – file sharing, and
                                                                                           – collaborative applications
                                                                                        • The three categories of P2P serve different purposes
                                                                                           – Distributed computing applications typically require the
                                                                                             decomposition of larger problem into smaller parallel problems
                                                                                           – File sharing applications require efficient search across wide
                                                                                             area networks and
                                                                                           – Collaborative applications require update mechanisms to
                                                                                             provide consistency in multi-user environment




             P2P Network Architectures                                                                     P2P Computing
• Centralized (Napster)
                                                                                        • File sharing (e.g.,
• Decentralized                                                                           Gnutella, Freenet,
                                                                                                                                       Communication and collaboration
                                                                                                                                          Groove
                                                                                                                                          Skype
  – Unstructured (Gnutella)                                                               Limewire, KaZaA)
  – Structured (Chord)                                                                  • Collaboration (e.g.,
                                                                                          Magi, Groove, Jabber)                                         Napster
• Hierarchical (MBone)                                                                  • Distributed computing
                                                                                                                                                        Gnutella
                                                                                                                                                        Kazaa
                                                                                                                                                        Freenet     File sharing
• Hybrid (EDonkey)                                                                        (e.g., SETI@home,                                             Overnet
                                                                                          Search for                         SETI@Home
                                                                                          Extraterrestrial                   folding@Home

                                                                                          Intelligence)               Distributed computing




                            Computer Systems



      Centralized Systems                  Distributed Systems
(mainframes, SMPs, workstations)


                         Client - server
                                                                 Peer-to-Peer
                                                                                                      P2P FILE SHARING
                                                                                                        APPLICATIONS
                                            Centralized               Decentralized




                                                     Structured          Unstructured




                                                                                                                                                                                   7
P2P Applications
                                                                               Napster: Example

• File sharing (music, movies, …)
     – utilise the idle disk space for storage and the existing                                                        m5

       network bandwidth for search and download.                                             m6                             E

     – The cost of operation is very low                                                            F
                                                                                                                        m1   A                                 D
          • majority of peers collect only objects that they are                                        E?
                                                                                                             E          m2   B                        m4
            interested in anyway.                                                                                       m3
                                                                                                                        m4
                                                                                                                             C
                                                                                                                             D
                                                                                                                 E?
                                                                                                                        m5   E
     – Eg: Napster, KaZaA and Gnutella                                                                            m5    m6   F

                                                                                                                                                           C
                                                                                                             A
                                                                                                                                                 m3
                                                                                                        m1
                                                                                                                                 m2




                       File Sharing Services                                                        Unstructured P2P

• Publish – insert a new file into the network                                        Flooded to connected peers                      Flooded between supernodes

• Lookup – given a file name X, find the host
  that stores the file
• Retrieval – get a copy of the file                                         search
                                                                                             transfer
                                                                                                                                                      supernode


• Join – join the network                                                                                                                   2.query


• Leave – leave the network
                – Neighbors                                                                                                       1.query                          peer node




                         Centralized P2P                                                           File Sharing: Gnutella
•   Utilize a central directory for object
    location                                                                  • Gnutella is a file sharing protocol
•   For file-sharing P2P, location inquiry            Centralized Server
    form central servers then downloaded
    directly from peers
                                                                              • Gnutella was originally designed by Nullsoft, a
•   Benefits
      – Simplicity
                                                                                subsidiary of America Online.
•
      – Limited bandwidth usage
    Drawbacks                                   1. query
                                                                              • Its architecture is completely decentralised and
      – Unreliable (single point of failure),
         performance bottleneck, and
                                                            upload indexes      distributed
         scalability limits
      – Vulnerable to DoS attacks
                                                    2. response
                                                                              • When a client wishes to connect to the network
      – Copyright infringement
                                                                                they run through a list of nodes that are most likely
                                                                                to be up or take a list from a website and then
                                                           3. transfer          connect to how ever many nodes they want




                                                                                                                                                                               8
Gnutella Search Mechanism                                                          Peer-to-Peer File Sharing is all about the trading of
                                                                               copyrighted music and videos without paying anything to the
                                                                                                          authors
       Assume: m1’s neighbors are m2 and m3; m3’s neighbors
        are m4 and m5;… A,B,C,D,E,F are resources
       TTL

                                       m5
                                                                              query
                                                  E                           music
            m6
                                                                              category
                 F                                                        D
                          E                           E?
                                                           E?    m4
                                                                              KaZaA
                                            E?                                Native
                                                                              Windows
                                  E?
                                                                              Application
                                                                      C
                              A                                               banner
                                                      B         m3
                     m1                                                       ad
                                                 m2
                                                                              3 million users online
                                                                                                              sharing 4 PetaBytes of data




• Advantages
  – Fast lookup
  – Low join and leave overhead
  – Popular files are replicated many times, so lookup with small TTL
    will usually find the file
     • Can choose to retrieve from a number of sources                                                 Searching
• Disadvantages
  – Not 100% success rate, since TTL is limited
  – Very high communication overhead
  – Uneven load distribution




                                   Kazaa                                                 Search in Unstructured P2P

                                                                              Two general types of search in unstructured p2p:
                                                                              Blind: try to propagate the query to a sufficient
                                                                              number of nodes (example Gnutella)
                                                                              Informed: utilize         information      about document
                                                                              locations
      Sharman Networks


   Kazaa is a file sharing program that allow you to download
      audio,video, images, documents and software files.




                                                                                                                                             9
Blind Search Methods
                                                                                                             APS – an example
                     BFS and Random Walk
                                                                               Node J holds the requested object
                                                                               Nodes deploy 2 walkers, initially
                                                                               All index values are 20
                                                                               TTL=3




• BFS                               Random walks
    •In unstructured networks, flooding would exhaust bandwidth of network.




                                                                                   Collaborative Community
                        Informed search                                       • Rapidly changing work environment
                                                                                 – Out-sourcing, in-sourcing, home-sourcing
                                                                                 – Tight integration and team work with customers,
         Informed: utilize information about document                             partners, vendors
          locations.                                                          • P2P allows management of documents at level
                                                                                of closed working groups.
             APS                                                             • The collaboration software is designed to
                                                                                improve the productivity of individuals with
                                                                                common goals or interests.
                                                                              • Groove is a collaborative P2P system
                                                                                (http://www.groove.net)
                                                                                 – Part of the Microsoft Office system
                                                                                 – Document sharing and collaboration –
                                                                                    • vital for a business.
                                                                                 – Office Groove 2007 is a collaboration software program
                                                                                     • helps teams work together dynamically and effectively, even
                                                                                       if team members work for different organizations, or work
                                                                                       remotely.

                                                                                                                                  Work Together: Anyone, Anytime, Anyplace
                                                                                                                                        Microsoft Office Groove 2007




          Adaptive Probabilistic Search
• Each node keeps a local index             Example (indices at node A)
  consisting of one entry for each
  object it has requested per neighbor.         A chooses B with Pr=0.3
• Index values represent the                    A chooses C with Pr =0.5
  probability of finding that object            A chooses D with Pr=0.2
  through that neighbor
• Searching is based on the
  simultaneous deployment of k
  walkers and probabilistic forwarding.
• if a hit occurs, the walker terminates
  successfully.
• On a miss, the query is forwarded to
  one of the node‟s neighbors.




                                                                                                                                                                             10
Distributed Computing: SETI@home
 Search for Extraterrestrial Intelligence -if we are alone
  in the universe or whether there is intelligent life
  somewhere else in the Universe.
 Over two million computers crunching away and
  downloading data gathered from the Arecibo radio
  telescope in Puerto Rico, USA
 The SETI@Home project is widely regarded as the
  fastest computer in the world
 Sharing of resources such as computation power,
  network bandwidth and storage
 Achieves computing power cheaper than a
  supercomputer can provide.
 Developed by the Space Sciences Laboratory, at the
  University of California, Berkeley, in the United
  States.http://setiathome.ssl.berkeley.edu
 Launched in 1996




            How SETI@home works?
   Collect data source
   Use telescope to collect data source from outer space at
    Arecibo.
   The SETI@home use data recorder to record data source on
    removable tape.
   Distribution of data source
   SETI@home divide data into fixed-size work units.
   SETI@home distribute these data via Internet from the
    servers to a client program.
   Client program computes result ,then returns it to the server,
    and gets another work unit.




            How SETI@home works? …
• Scientific experiment - uses Internet-connected computers
• Distributes a screen saver–based application to users
• Applies signal analysis algorithms different data sets to process radio-telescope data.
• Has more than 3 million users




                                   3. SETI client gets
                                   data from server and runs
      Main Server
                             4. Client sends results
                             back to server
      Radio-telescope
           Data


                                                          2. SETI client (screen
                                                          Saver) starts




                                                                                            11
Super nodes
• “… a free program that uses the latest P2P…technology to                                • Super nodes are Skip clients run by users that have a
  bring affordable and high quality voice communications to people                          “good” Internet connection and a “good” computer.
  all over the world…”                                                                    • Having a good Internet connection means having a public
• Skype offers voice, video, chat and data transfer                                         IP address, without firewall restrictions.
  services over IP                                                                        • A good computer is a machine that can forward other
• The first stable version of Skype has been released in July                               users‟ communications and handle many connections.
  2004, since then the number of users kept on growing.                                   • SN have a role of relay in the network
• Nowadays Skype claims having more than 20 millions                                         – Hence, they need a better connectivity and better performances.
  accounts and between 4 and 6 millions of users                                          • 1 SN are used to connect SC together.
  simultaneously connected.




                                                                                                                      Skype
             Skype Software features
                                                                                          Skype – login
• VoIP from computer to computer
   – The most used feature especially.                                                    • Skype clients directly connect to login
• VoIP from computer to regular phone (Skype Out)
   – By registering on Skype‟s website it is possible buy credit and then call all over
                                                                                            servers, whose IP addresses are hard
     the world with very interesting rates compared to rates applied by phone
     companies.
                                                                                            coded within the software.
• Video conferencing Introduced in Skype2.0 in 2006.                                        – In this connection the login name and
• Instant Messaging This feature is comparable to many other                                  the version are sent in clear text format.
  instant messaging clients like MSN Messenger, Yahoo! Messenger,
  Google Talk, etc.
   – The main difference is that Skype does not tell the user whether the person he
                                                                                          • The login server stores all of user
     is chatting with is typing or not. This is due to the P2P design of the Skype
     network.
                                                                                            names and passwords and ensures
• File Transfer                                                                             that names are unique across the
   – The Skype network design has a big influence on the quality of file transfers.
     It can make it very fast (1Mbps) or very slow (3 kbps).                                Skype name space




                Internet Telephony - Skype
• The participants form a self-organizing                                                 • Connection to a bootstrap node
  P2P overlay network to locate and                                                          – When SC (Skype Client) is installed the first time it
  communicate with other participants.
                                                                                               come with a list of SN to connect to.
• The bandwidth is shared and the sound
  or video in real time is shared as resource                                                – First, the Skype Client tries to connect to 5 SN sending
• Skype has a similar architecture as its                                                      a UDP packet to IP addresses of super nodes
  predecessor KaZaA                                                                            randomly chosen in the host cache.
• There are three types of nodes in the                                                      – When the client finds a super node to connect to, it
  Skype network:
                                                                                               refreshes its list of active and available super nodes in
   – Ordinary-peers
                                                                                               host cache.
   – Super-nodes
   – Central login server                                                                    – SC connects to a SN
• Communications are encrypted (RSA)




                                                                                                                                                                 12
Traffic volume content type (Germany, BitTorent)
  Skype - user search
  • Similar to KaZa (searching for callee)
  • Client sends an user name to SN and as an answer
    receives few IP addresses and port numbers
  • Subsequently the client contacts these nodes
  • If it cannot find the user it sends request to its SN
    once again and as a result receives another few IP
    addresses and port numbers
  • The process continues until the user is found




                                                                                              What is PPLive?
  Skype - call establishment                                         What is PPLive?
                                                                       – An online video broadcasting and advertising
  • Routing in the Skype overlay network is done by                       network
                                                                            • Provides an online viewing experience
    the SN.                                                                   comparable to that of traditional TV
                                                                              broadcasting
                                                                            • 75 million global installed base and 20
  • When a SC tries to establish a call, it first ask its                     million monthly active users
                                                                            • 600+ channels on PPLive with content
    SN (if it is not a SN itself) where is the callee and                     ranging from news, music, sports, movies,
    tries to connect directly to it.                                          games, live video and other interactive
                                                                              services to a global audience
                                                                       – An efficient P2P technique platform and test
       – If the SC is restricted because of firewall then it will         bench
         connect to the callee using a SN as a relay.                History of PPLive:
       – If both a caller and a callee have public IP addresses, a   • Bill’s story
                                                                         – Inventor of PPLive core technology
         caller sends signaling information over TCP to a callee         – Dropped out of post-graduate program to start
                                                                            PPLive




  P2P VIDEO STREAMING                                                                                 PPLIVE
• Streaming video is content sent in compressed form over
  the Internet and displayed by the viewer in real time.
• With streaming video or streaming media, a Web user does
  not have to wait to download a file to play it - the media is
  sent in a continuous stream of data and is played as it
  arrives.
• The user needs a player, which is a special program that
  uncompresses and sends video data to the display and
  audio data to speakers.
• A player can be either an integral part of a browser or
  downloaded from the software maker's Web site.

• P2P streaming
   – P2P TV
       • PPLive, PPStream, Joost (by Skype
         founders), …




                                                                                                                            13
Streaming Tree Reconstruction after a Peer
                            Industry Trends                                                             Departure
PPLive is well positioned to exploit the next explosive growth




          Advanced                        Video Streaming
                                                                 PPLive
         Applications

                                    VOIP
                                                         Skype

                            Downloading
                                            BitTorrent
                           File Sharing
           Basic                  Napster
        Applications



                                    2001       2003      2004     2005




                                           PPLive                                                    Multi-tree Streaming




        Media Server (channel management server) - Retrieve list of channels via HTTP

        Membership Server -Retrieve small list of members nodes of interest via UDP     Since all peers are involved in the data distribution, the load is
                                                                                        spread among all nodes.




                        Single-tree Streaming                                           A snapshot of a tree-based overlay with 231 nodes
• A common approach to P2P
  streaming is to organize
  participating peers into a single
  tree-structured overlay
    – The content is pushed from the
      source towards all peers.
    – This way organizing peers is called
      single-tree streaming.
• In these systems, peers are
  hierarchically organized in a tree
  structure where the root is the
  stream source.
• The content is spread as a
  continuous flow of information
  from the source down to the
  tree.




                                                                                                                                                             14
Overall Architecture
                                                                          Web Server          Tracker

                          Bit Torrent

                 •Created by Brahm Cohen in 2001
                                                                                                         C
                                                                  A
                                                                                                        Peer
                                                                Peer                                    [Seed]
                                                                                        B
                                                              [Leech]
                                                             Downloader                Peer
                                                                “US”               [Leech]




                 What is BitTorrent?
                                                                          Overall Architecture
• A peer-to-peer file transfer protocol
                                                                                              Tracker
• Extremely popular today                                                 Web Server

• “Pull-based” “swarming” approach
• Each file split into smaller pieces
• Nodes request desired pieces from
  neighbors
• As opposed to parents pushing data
                                                                                                         C
  that they receive                                               A
• Pieces not downloaded in sequential                                                                   Peer
  order                                                         Peer                                    [Seed]
                                                                                        B
• Encourages contribution by all nodes                        [Leech]
                                                             Downloader                Peer
                                                                “US”               [Leech]




                 Overall Architecture                                     Overall Architecture
                  Web Server              Tracker                         Web Server          Tracker




                                                     C                                                   C
         A                                                        A
                                                    Peer                                                Peer
       Peer                                         [Seed]      Peer                                    [Seed]
                                B                                                       B
     [Leech]                                                  [Leech]
    Downloader                 Peer                          Downloader                Peer
       “US”                [Leech]                              “US”               [Leech]




                                                                                                                 15
Overall Architecture                                       BitTorrent Lingo
             Web Server          Tracker

                                                      Seeder = a peer that provides the complete file.
                                                      Initial seeder = a peer that provides the initial copy.
                                                                                Leecher
                                                             Initial seeder
                                                                                                     One who is downloading
                                            C
     A
                                           Peer
   Peer                                    [Seed]                                                                Leecher
                           B
 [Leech]
Downloader                Peer
                                                       Seeder
   “US”               [Leech]




             Overall Architecture                                    BitTorrent Basics
             Web Server          Tracker
                                                    • Files are broken into pieces.
                                                      – Users each download different pieces from the
                                                        original uploader (seed).
                                                      – Users exchange the pieces with their peers to obtain
                                                        the ones they are missing.

     A
                                            C
                                                    • This process is organized by a centralized server
                                           Peer       called the Tracker.
   Peer                                    [Seed]
                           B
 [Leech]
Downloader                Peer
   “US”               [Leech]




             Overall Architecture                                    Critical Elements
             Web Server          Tracker
                                                    • A web server
                                                       – stores and serves the .torrent file.
                                                       – For example:
                                                           • http://bt.btchina.net                  Web Server

                                                           • http://bt.ydy.com/
                                            C
     A
                                           Peer                 The Lord of Ring.torrent
   Peer                                    [Seed]
                           B
 [Leech]
                                                                                           Troy.torrent
Downloader                Peer
   “US”               [Leech]




                                                                                                                              16
BitTorrent Swarm
                  Critical Elements
                                                                                        • Swarm
 • The .torrent file                                                                          – Set of peers all downloading the same file
   – Static „metainfo‟ file to contain necessary                                              – Organized as a random mesh
     information :                                                                      • Each node knows list of pieces downloaded by neighbors
       •   URL of tracker                                                               • Node requests pieces it does not own from neighbors
       •   Piece length – Usually 256 KB                              Matrix.torrent    -------------------------------------------------
       •   SHA-1 hashes of each piece in file                                           • swarm
       •   IP address of the Tracker                                                          – The group of machines that are collectively connected for a
                                                                                                particular file.
                                                                                                    • For example, if you start a BitTorrent client and it tells you that you're
                                                                                                      connected to 10 peers and 3 seeds, then the swarm consists of you and
                                                                                                      those 13 other people.




                                                                                                     How a node enters a swarm
                  Critical Elements
                                                                                                       for file “popeye.mp4”
 • A BitTorrent tracker
   – The tracker maintains information about all BitTorrent                                                                                • File popeye.mp4.torrent
     clients utilizing each torrent.                                                                                                         hosted at a (well-known)
   – The tracker identifies the network location of each client                                                                              webserver
     either uploading or downloading the P2P file associated with                                                                          • The .torrent has address of
     a torrent.
                                                                                                                                             tracker for file
   – It also tracks which fragment(s) of that file each client
     possesses, to assist in efficient data sharing between clients.                                                                       • The tracker, which runs on a
       • i.e. the tracker keeps track of all peers downloading file                                                                          webserver as well, keeps
   For example:                                                                                                                              track of all peers
       • http://bt.cnxp.com:8080/announce                                                                                                    downloading file
       • http://btfans.3322.org:6969/announce




                    Critical Elements                                                                How a node enters a swarm
                                                                                                       for file “popeye.mp4”
• An end user (peer)                                                                                               www.bittorrent.com
  – Guys who want to use BitTorrent must install                                                                                           • File popeye.mp4.torrent
    corresponding software or plug-in for web browsers.                                                                                      hosted at a (well-known)
                                                                                                1
  – Downloader (leecher) : Peer has only a part ( or none ) of                                                                               webserver
    the file.                                                                          Peer                                                • The .torrent has address of
                                                                                                                                             tracker for file
  – Seeder: Peer has the complete file, and chooses to stay
                                                                                                                                           • The tracker, which runs on a
    in the system to allow other peers to download
                                                                                                                                             webserver as well, keeps
  – BitTorrent clients connect to a tracker when attempting                                                                                  track of all peers
    to work with torrent files.                                                                                                              downloading file
     • The tracker notifies the client of the P2P file location (that is
       normally on a different, remote server).




                                                                                                                                                                                   17
How a node enters a swarm                                                 Three elements necessary to sharing a file
               for file “popeye.mp4”                                                              with BitTorrent
                         www.bittorrent.com                                        •       The tracker - coordinates connections among the peers.
                                                                                       –      Tracker doesn't know anything of the actual contents of a file
                                                  • File popeye.mp4.torrent            –      Generally, it's considered good manners to continue seeding a file after you
                                                    hosted at a (well-known)                  have finished downloading, to help out others.
                                                    webserver                      •       The web server - stores and serves the .torrent file.
             2                                    • The .torrent has address of    •       At least one seeder
Peer
                                                                                       –      Contains any of the file's actual contents.
                                                    tracker for file                   –      The seeder is almost always an end-user's desktop machine (peer), rather
                              Tracker             • The tracker, which runs on a              than a dedicated server machine.
                                                    webserver as well, keeps           –      Seeding is monitored by the Tracker
                                                                                       –      Seed your file for a long time to prevent peers from being left with
                                                    track of all peers                        incomplete files.
                                                    downloading file               •       When you finish a download in BitTorrent, and you are only
                                                                                           uploading, you're seeding!




             How a node enters a swarm
                                                                                                                         File sharing
               for file “popeye.mp4”
                         www.bittorrent.com
                                                                                       Large files are broken into pieces of size between
                                                  • File popeye.mp4.torrent
                                                    hosted at a (well-known)           64 KB and 1 MB
                                                    webserver
Peer                                              • The .torrent has address of
                                                    tracker for file
         3                    Tracker             • The tracker, which runs on a
                                                    webserver as well, keeps
                                                    track of all peers
                                                    downloading file
                                                                                               1           2         3          4       5         6          7       8
 Swarm




                       BT: publishing a file                                                                     A trivial example
                                                                                                                                    {1,2,3,4,5,6,7,8,9,10}
                             Harry Potter.torrent
                 Bob
                                                                                                                         User

                                                                                                                       Seeder:
                                                                                                                       John
                                                             Web Server




                                                                                                                                                        {}
                                                                                                                                                         {1,2,3}
                                                        Tracker                                                                                          {1,2,3,5}
                                                                                                                {}
                                                                                                               {1,2,3}
                                                                                                                {1,2,3,4}
                                                                                                                {1,2,3,4,5}                      User




       Downloader:       Seeder:              Downloader:                                           User
                                                                                                                                             Downloader
       A                 B                    C                                                 Downloader                                   Joe
                                                                                                Fan Bin




                                                                                                                                                                             18
Types of P2P Attacks
         P2P Technical Challenges
                                                                 • Poisoning: a client can provide content that doesn‟t
                                                                   match the description.
   •   Routing protocols                                           – A client A, can broadcast a message saying it needs file
   •   Network topologies                                            „X‟. A malicious client can send a message back to A
   •   Peer discovery                                                saying it has file X, then send it file Y.
   •   Communication/coordination protocols                      • Denial of Service attacks that decrease or cease
   •   Quality of service                                          total capable network activity.
   •   Security                                                  • Defection attacks which allow a client to participate
                                                                   on the network with a very low upload-to-
                                                                   download ratio.




                                                                              Types of P2P Attacks….


         P2P SECURITY                                             • Virus attacks, where a malicious client can add
                                                                    viruses into files shared on the network.
                                                                  • Malware attacks, where the P2P software
            Security is the condition of being protected            contains spyware.
                       against danger or loss.                    • Filtering attacks, where network operators may
                                                                    attempt to prevent P2P network data from being
                                                                    carried out.




                    P2P Security                                              Attacks On & From
• P2P file sharing networks are constantly under                  • Attacks on P2P systems:
  attack.
• P2P is potentially more vulnerable than client server.
  – Decentralized
  – More difficult to manage and control                          • Attacks from P2P Systems:
• Need to understand the security issues for
  architecting future P2P apps


                                                           111                                                             114




                                                                                                                                 19
Attacks on P2P sharing                                                             File Pollution
   Two types:
                                                                           Unsuspecting users
                                                                                                                      Alice
                                                                           spread pollution !
   • Pollution: file corruption  File Content
   • Index poisoning  File Index




                                                                   115                                Bob                118




                                                                                                  File Pollution
original content
                     polluted content

                                                                           Unsuspecting users
                                                                           spread pollution !
        pollution
        company




                                                                                           Yuck

                     File Pollution
                                                                   116                                                   119




                        File Pollution                                                   INDEX POISONING
                                                                         • Aim of the attacker is to make several
                                                                           peers believe that some popular file is
                                                                           present with the victim.
                                                                         • Attacker sends a location publish
                                                 pollution                 message to every crawled peer.
                                                 server                  • In this message, the attacker includes
                                                                           victim‟s IP address and port number.
      pollution                                                          • Attacker puts the file hash of a popular
      company                                                              file along with the message.
                                        file sharing                     • Peer B adds this file hash into it along
                                        network                            with the location of the victim.
                    pollution                                pollution   • When a peer C searches for that file, it
                    server                                   server        may be told by some poisoned peer that
                                                                           victim has the file.

                                                 pollution
                                                 server
                                                                   117




                                                                                                                               20
Index Poisoning                                                                      Free Riding

                                                                               • Peers share little or no data in P2P file-sharing
                                                                                 systems
                                 index                     23.123.78.6
                           title     location                                  • Measurement
                           bigparty 123.12.7.98
                           smallfun 23.123.78.6
                                                                                   – Nearly 70% of Gnutella users share no files
      123.12.7.98          heyhey 234.8.89.20                                      – Nearly 50% of all responses are returned by the
                                                                                     top 1% of sharing hosts
                                          file sharing                         • Incentive mechanisms to encourage user
                                          network
                                                                                 cooperation

                                                         234.8.89.20
                                                                         121




                    Index Poisoning

                                                                                                             P2P Worms
                                 index                     23.123.78.6
                           title     location
                           bigparty 123.12.7.98                                                 Topological                 Passive
      123.12.7.98
                           smallfun 23.123.78.6
                           heyhey 234.8.89.20
                                                                                                Scan Worms                  Worms
                           bighit    111.22.22.22




                                                                               A computer worm is a self-replicating malware computer program.
                                                         234.8.89.20           It uses a computer network to send copies of itself to other nodes
                           111.22.22.22                                        It may do so without any user intervention.
                                                                         122




   ROUTING TABLE POISONING                                                            TOPOLOGICAL WORM ATTACK
• The aim of the attacker is to
  make the peers add victim as
  their neighbors
• Attacker sends node
  announcement messages to
  every crawled peer.
• Attacker includes victim‟s IP
  address and port number in
  these messages
• The peers add victim as their
  neighbor
• Query messages are forwarded
  to the victim




                                                                                                                                                     21
TOPOLOGICAL WORM ATTACK                                                                        Effects

                                                                                  • Eating up free disk space
                                                                                  • Benjamin opens a Web page, called
                                                                                    benjamin.xww.de to display banner ads.
                                                                                    – One day morning the Benjamin.xww.de Web site
                                                                                      had a message saying: "Domain closed due to
                                                                                      massive abuse."




                     PASSIVE P2P WORMS

• Vulnerability in the protocol
• Wait for the vulnerable targets to contact them
• Case 1
  – Worm can create infected copies of itself with attractive filenames and
    place them in the shared folder of the P2P client or will replace the files
    present in the shared folder with itself                                        How vulnerable is BitTorrent?
  – e.g. VBS.Gnutella, Benjamin Worm etc.
• Case 2
  – Answers positively to a proportion of search queries by changing the
    name of the corrupted file to match the search query
  – e.g. Gnuman




                                                                                                                                     131




               P2P-Worm.Win32.Benjamin.a
                                                                                               Pollution Attack
 • P2P-Worm.Win32.Benjamin.a (Kaspersky Lab) is also
   known as: Worm.P2P.Benjamin.a (Kaspersky Lab),                                 • 1. The peers
   W32/Benjamin.worm (McAfee),                                                      receive the peer
   W32.Benjamin.Worm (Symantec),
   Win32.HLLW.Benjamin (Doctor Web)                                                 list from the
 • This worm uses the Kazaa file exchange P2P network                               tracker.
   to spread itself.
 • Benjamin is written in Borland Delphi and is
   approximately 216 Kb in size - it is compressed by the
   AsPack utility.




                                                                                                                                           22
Pollution Attack                    DDOS Attack
• 2. One peer                   • DDOS = Distributed denial of service
  contacts the                  • Based on the fact the BitTorrent Tracker has no
  attacker for a                  mechanism for validating peers.
  chunk of the file.            • Uses modified client software




             Pollution Attack                    DDOS Attack
• The attacker sends            • 1. The attacker
  back a false                    downloads a large
  chunk.                          number of torrent
• This false chunk                files from a web
  will fail its hash              server.
  and will be
  discarded.




             Pollution Attack                    DDOS Attack
• 4. Attacker                   • 2. The attacker parses
  requests all chunks             the torrent files with a
                                  modified BitTorrent
  from swarm and                  client and spoofs his IP
  wastes their                    address and port
  upload bandwidth.               number with the victims
                                  as he announces he is
                                  joining the swarm.




                                                                                    23
Current Solutions: Pollution
                              DDOS Attack
                                                                                                       Attacks
• 3. As the tracker                                                                      • Blacklisting
  receives requests for a                                                                  – Achieved using software such as Peer Guardian or
  list of participating                                                                      moBlock.
  peers from other                                                                         – Blocks connections from blacklisted IPs which are
  clients it sends the                                                                       downloaded from an online database.
  victims IP and port
  number.




                                                                                        Solutions – TRUST and REPUTATION
                              DDOS Attack
                                                                                         • Most of the solutions proposed to solve the problem of attacks are
 • 4. The peers then                                                                       based on building trust (and/or reputation) between
   attempt to                                                                              the peers
   connect to the                                                                        • Some of the popular approaches are:
                                                                                            – DCRS - Bit Torrent
   victim to try and
                                                                                            – EigenTrust
   download a chunk
                                                                                            – XRep
   of the file.                                                                          • These approaches do slow down the attack




                         Attack illustration
                                                                                          What is Trust? What is reputation?
                                                                                       • Trust – a peer‟s belief in another peer‟s capabilities, honesty
                                           victim
                                                                                         and reliability based on its own experiences.
                                                                                       • Reputation – a peer‟s belief in another peer‟s capabilities,
                                                        Who has the files?               honesty and reliability based on recommendations received
                                          Tracker                                        from other peers.
                                                                             clients
                                                                                         – Reputation can be centralized, computed by a third party or it can
   Discussion                                                                              be decentralized, computed independently by each other after
     forum                                                                                 asking other peers recommendations.
                                                Victim has the files!
                .torrent
                   .torrent
                      .torrent
                         .torrent
                            .torrent
                               .torrent   attacker




                                                                                                                                                                24
What is Trust? ……..                                       An Example Trust Management System
• Both Trust and Reputation are used to evaluate a peer‟s                            (BitTorrent)
  trustworthiness.
• Trust and Reputation increase or decrease with further             • Debit-Credit Reputation System
  experience.                                                        • Each client calculates a local trust
• Trust and reputation both depend on some context.                    score for their peers Based on valid
                                                                       pieces uploaded /downloaded
• For example:
                                                                     • Tracker combines these individual
  – Mike trusts John as his doctor, but he doesn‟t trust John as a     scores to make a global score
    mechanic who can fix his car.
      • In the context of seeing a doctor, John is trustworthy
      • In the context of fixing a car, John untrustworthy.




                                                                                       DCRS… …(cont’d)
      What is Trust Management ?
                                                                      Local Trust Score Computation
  • “Trust Management” was first coined by Blaze
    et. al 1996                                                       Fij=Uij-Dij,
                                                                               Uij – the number of chunks that i uploaded to j,
    – a coherent framework for the study of security
                                                                               Dij- the number of chunks that i downloaded from j
      policies, security credentials and trust                        Using Fij, the local trust score LTij is computed as
      relationships.                                                      -1 if bogus chunk is uploaded by peer j
                                                                          0 if Fij >t
                                                                          1 if Fij <= t, where „t‟ is the fairness threshold




              Reputation Management                                                      DCRS… …(cont’d)
• Need for trust mechanisms                                          Global Trust Score Computation
  – To assess trustworthiness of peers and the content               • Global Trust Scores are a representation the rest of the
        • Malicious peers generate unlimited number of inauthentic
                                                                       swarms opinion of a peer.
          files                                                      • At regular interval the tracker receives the local trust
   – To deter malicious behavior                                       scores of peers in the swarm.
• Reputation is an assumption that past behavior is                  • The tracker chooses „k‟ , where „k‟ is < the number of
  indicative of future behavior                                        peers in the swarm, random local trust scores for peer j in
                                                                       the swarm.
• Use of reputation to build trust
                                                                     • Tracker uses k local trust scores for peer j and sets the
                                                                       average of them as the global trust score for j




                                                                                                                                     25
DCRS…(cont’d)
                                                                                                                       • P2P systems already store a huge amount of widely
  • Global trust managed by the tracker prevents clients
                                                                                                                         varying data collected from different sources.
    from being dishonest.
                                                                                                                       • If this data, distributed over large number of peers, can
  • Solve the issue of pollution attacks by ignoring
                                                                                                                         be integrated,
    untrustworthy peers
                                                                                                                           – This represents a very valuable data repository that, upon
    – Trust systems are more flexible than blacklisting                                                                      mining, may give very exciting and useful results.
       because peers can earn back their trust through good                                                            • Peer-to-peer K-means Algorithm
       behavior.
                                                                                                                          – K-means clustering partitions a collection of data
  • Prevent DDOS attacks because the victim will earn a low                                                                 tuples, into K disjoint, exhaustive groups (clusters),
    trust score and be ignored.                                                                                             where K is a user-specified parameter.




                                                                                                                        Example: Topic-wise document clustering in a P2P
                                                                                                                                      document repository
                                                                                                                      • Documents stored in different
                                                                                                                        peers are clustered based on
                                                                                                                        three subjects
                                                                                                                         – movies
                                                                                                                         – baseball
                                                                                                                         – hurricane
       Other Emerging P2P Applications                                                                                  by exchanging information
                                                                                                                         with other peers.
                                                                                                                      • In a P2P clustering,
                                                                                                                         – some peers may not be present in
                                                                                                                           the network all the time, and
                                                                                                                           may join or leave the network
                                                                                                                           while the clustering is in
                                                                                                                           progress.




        Distributed Data Mining in P2P Networks                                                                                           Cloud computing
• Data mining, the extraction of hidden predictive
  information from large databases                                                                                     • Cloud computing is a computing paradigm shift where
                                                                                                                         computing is moved away from personal computers or
• Most off-the-shelf data mining systems are designed to                                                                 an individual server to a “cloud” of computers.
  work as a monolithic centralized application.
                                                                                                                       • Users of the cloud only need to be concerned with the
• Distributed data mining (DDM) deals with the problem of                                                                computing service being asked for, as the underlying
  data analysis in environments with distributed data,                                                                   details of how it‟s achieved are hidden.
  computing nodes, and users.
                                                                                                                       • Done through pooling all computer resources together
• P2P networks are well-suited to distributed data mining                                                                and being managed by software rather than a human.
  (DDM)
                                                                                                                       • Prominent players include Google (AppEngine),
• A primary goal of P2P data mining is to achieve the same                                                               Microsoft (Azure), Amazon (EC2), Yahoo-Apache
  (or close) data mining result as a centralization approach,                                                            (Hadoop) and Cisco-EMC (Acadia) .
  without moving any data from its original location.
Souptik Datta Kanishka Bhaduri Chris Giannella Ran Wolff, Hillol Kargupta, “Distributed Data Mining in Peer-to-Peer
Networks”




                                                                                                                                                                                          26
Cloud Architecture                                                                                So What’s the Issue?
                                                                                           • These super server-warehouses are expected to consume
                                                                                             around 300 MegaWatts (MW) of electricity a month.
    Individuals          Corporations       Non-Commercial
                                                                                           • Existing large data-centres consume anywhere from 20 to
                                                                                             50 MW of electricity (enough to power 40,000 homes).
                                                                                                     – Hence, the energy consumption of the million-server warehouse
                                                                                                       raises serious concerns on their environmental sustainability.
                                                                                           • The environmental impact of cloud computing has not
                                                                                             received the desired attention of the research community
                          Cloud Middle Ware                                                  and needs to be addressed.
Storage
Provisioning
                  OS
                  Provisioning
                                   Network
                                   Provisioning
                                                     Service(apps)
                                                     Provisioning
                                                                      SLA(monitor),
                                                                      Security, Billing,
                                                                                                     – It is estimated that Google‟s Data Center‟s alone consume over
                                                                      Payment                          1.5% of the electricity produced world-wide.
                                                                                                     – The larger data-centers in the US are estimated to consume 25.6
                                 Resources                                                             GwH of electricity per year and produce 17006 tones of Co2
                                                                                                       emissions
  Services            Storage              Network                    OS




                                                                                                                     The Peer Enterprises Framework
               The Scale of the Cloud




Google‟s million-server warehouse (Oregon, USA). Each building is approximately
the size of 2 football fields. Source IEEE Spectrum, Feb. 2009




                                                                                                                                                  PE vs Cloud
                  The Scale of the Cloud
                                                                                                      Parameter                        Cloud Computing                                    Peer Enterprises
                                                                                           1. Cost                         Expensive to provision. Involves creation of Uses already provisioned compute infrastructure.
                                                                                                                           internet-scale data-centres costing hundreds of Hence, no new costs need to be incurred.
                                                                                                                           millions of dollars.
                                                                     Each container        2. Energy Consumption           Around 300 MegaWatts per month.                  Already provisioned and in-use. No additional
                         Cooling                                     houses 2500                                                                                            energy consumption.
                         Towers                                      servers
                                                                                           3. Environmental Impact         Yes. Very High.                                  Yes. But, the PE concept does not place an
                                                                                                                                                                            additional load on the environment.

                                                                                           4. Service Migration            Tedious. Since, vendor interoperatability is Easy. Organizations can enter into new contracts
                                                                                                                           undefined as yet.                            and terminate existing contracts.

                                                                                           5. Degree of Decentralization   Fairly centralized control. All applications Based on the decentralized P2P concept. No
                                                                                                                           running in a single data-centre. Centralized centralized elements/control. Organizations are
                                                                                                                           elements for load-balancing, scalability etc. free to negotiate resource sharing contracts as
                                                                                                                                                                         required.
                                                                                           6. Data Lock-In                 Yes. Once all data resides with a single vendor, Organizations can enter into multiple contracts to
                                                                                                                           any connectivity faults can render the data avoid data lock-in by creating requisite
                                                                                                                           irretrievable.                                   redundancy.

Schematic of the million-server warehouse. The largest most complex data-center in         7. Performance                  Expected to be very performant, since Less performant due to frequent node transience.
                                                                                                                           dedicated compute infrastructure is involved. Performance enhancements and optimizations
the world. Each container has built-in networking, cooling and storage bundled
                                                                                                                                                                         need to be devised.
together. Source: IEEE Spectrum, Feb. 2009
                                                                                           8. Vendor Dependence            High. Compute infrastructure from only one None. As many service providers can be used by
                                                                                                                           provider can be used.                      entering into contracts.




                                                                                                                                                                                                                                 27
Problems/challenges for ad hoc networks
                                                                         • Problems are due to
                                                                            – Lack of central entity
                                                                              for organization
                                                                              available
                                                                            – Limited range of
                                                                              wireless communication
                                                                            – Mobility of participants
                                                                            – Battery-operated entities




 WWW + Mobile Telephony = Mobile Access to                                                         Mobile P2P?
               Information
                                                                           • Transferring data from one mobile phone to another
 700
                                                                           • Mobile phone and network limit the possibilities of mobile P2P
 600                                                  Mobile Telephone        – Low efficiency (CPU and memory)
                                                           Users
 500                                                                          – Low bandwidth
                                                                              – Low Power constraint due to energized by battery
 400
                                                       Internet Users           power
 300                                                                          – Billing
 200

 100                                                                         Much more challenging as compared to traditional P2P
   0
   1993 1994 1995 1996 1997 1998 1999 2000 2001




       MANET: Mobile Ad hoc Networks                                                 Full mobile P2P in 2/2.5G
Ad Hoc networks are wireless, self-organizing systems that
 provide functionality without infrastructure support.                     • In 2/2.5 there are limitations that are impossible to overcome:
Ad hoc means that there are no central servers.                              – Operators do not allow to see mobile phones IP address
 Content is distributed to several nodes instead of one server               – Operators control data traffic
MANET- A collection of wireless mobile nodes dynamically forming              – Network does not offer any way to sustain active connection
a network without any existing infrastructure and the relative                  in all situations
position dictate communication links (dynamically changing).                  – Voice and data can not be transferred simultaneously




                                                                                                                                               28
A solution to 2/2.5 P2P: MMS                                           Computer aided P2P: short distance

  • MMS could be used as a way of sending data                            • Within short distance we would not have true mobile
    from one mobile node to another.                                        P2P:
  However there are problems:
    – How to know who has the information you need?
    – MMS size is limited
    – MMS costs more than GPRS data

                                                                          • A better solution would be to control fixed network
                                                                            peer remotely




           A solution to 2/2.5 P2P:MMS                                      Computer aided mobile P2P: remotely
  • We have to have a server that keeps a                                 • For example over http we could control the fixed network peer
    record of MSISDN (IMSI) number and                                      by using a program called mobile eMule
    the data that can be found from that
    number
  • Downloader asks the data and the
    person who is downloaded permits or
    denies download.


Mobile Station International Subscriber Direct
ory Number (MSISDN) is a number used to
identify a mobile phone number
internationally.                                 IMSI:429 01 1234567890
MSISDN = CC + NDC + SN                           CC    429   Nepal
CC = Country Code                                            Nepal
                                                 NDC   01
NDC = National Destination Code                              Telecom
SN = Subscriber Number                           SN    1234567890




          A better solution: computer aided P2P                                                      eMule
                                                                          • eMule is a free peer-to-peer file
   • All the major limitations could be overcome if the mobile              sharing application for
     phone would be connected to a computer which has P2P
                                                                            Microsoft Windows.
     software
   • We would only need a software to communicate between                 • The name "eMule" comes from
     the computer and mobile phone:                                         an animal called "Mule" which
      – Short distance: Infrared, Bluetooth etc.                            is somehow similar to a donkey
      – Remotely: Over HTTP




                                                                                                                                            29
3G

                                                                                             • Deliver speeds up to 14.4 Mbit/s on the downlink and
                                                                                               5.8Mbit/s on the uplink.
                                                                                             • Consumers will be charged on the quantity of data
                                                                                               they transmit, not on how much time they are
                                                                                               connected to the network.
                                                                                             • With 3G you are constantly online and basically pay
                                                                                               for the information you receive.
                                                                                             • While third-generation packet based networks will
                                                                                               allow users to be online all the time the capability for
                                                                                               new applications is huge.




                   eMule – how it works?                                                                           Threats to mobile P2P
                                                                                        •   In 3G true mobile P2P is possible due to high bandwidth, efficient mobile phones and
• Each file that is shared using eMule is hashed as a                                       simultaneous voice and data capability
  hash list using the MD4 algorithm.                                                         – But will the operators allow P2P software since is would lead to the loss of revenues?
                                                                                             – In the 3G network architecture, every data connection of a mobile terminal is routed
• The MD4 hash, file size, filename, and search                                                 through the operator‟s network. This makes it possible for the operator to fully control
  attributes are stored on eD2k servers                                                         the traffic of mobile terminals.
• Users can search for filenames in the servers                                                   • For example, a network operator has the power to allow or prevent terminal-to-
                                                                                                     terminal connections in its network.
• Users are presented with the filenames and the unique
                                                                                                  • P2P protocols demand direct connections between the peers because their key idea
  identifier consisting of the MD4 hash for the file and                                             is that the peers communicate directly with each other without any central server.
  the file's size that can be added to their downloads.                                                – Lack of terminal-to-terminal connections would make it impossible for true P2P to
• The client then asks the servers where the other                                                       exist.

  clients are using that hash.                                                          •   Data transfer fees are currently quite high - reduces the willingness of users to share data in
                                                                                            MP2P networks
• The servers return a set of IP that indicate the                                      •   Use of MP2P applications may reduce the possibilities of operators to sell their own
  locations of the clients that share the file.                                             services.
• eMule then asks the peers for the file.                                               •   Viruses, spy etc.




          Computer aided mobile P2P: eMule                                                        P2P Based Software Engineering
                                                                                             • With rapid development of the network technologies,
                                                            3. download   4. download          software development is becoming more and more
    1. login                       2. search
                                                            to peer       to phone             complicated.
                                                                                             • Traditional SE management methods based on C/S
                                                                                               structure have not been very competent for large scale
                                                                                               software development.
                                                                                             • Proposes a SE management method based on P2P
                                                                                                  – Overcomes the servers‟ bottlenecks existed in C/S
•    eMule is a working solution                                                                  – Makes full advantages of computation resources
•    eMule has a large user base, currently averaging 3 to 5 million



                                                                                            Lina Zhao, Yin Zhang, Sanyuan Zhang, and Xiuuzi Ye, “P2P-Based Software
                                                                                            Engineering Management”




                                                                                                                                                                                              30
Future
•   Semantic P2P
•   Cloud Computing
•   Data Mining
•   P2P Based Software Engineering
•   Audio/Video Streaming
•   Security – autonomic computing
•   Collaborative learning
•   Mobile P2P
•   Emergency First Response




                                     31

Research Issues in P2P Netwroks

  • 1.
    How to RunApplications Faster ? Research Issues in P2P • There are 3 ways to improve performance: – Work Harder Computing – Work Smarter – Get Help • Computer Analogy – faster hardware high performance processors or peripheral devices – Optimized algorithms and techniques used to solve computational tasks – Multiple computers to solve a particular task Distributed…. OUTLINE • When a handful of powerful computers are • Centralized Vs. Distributed linked together and communicate with each • What is P2P? other • P2P Architectures – the overall computing power available can be • P2P and Applications amazingly vast. • Search and Replication Techniques – Such a system can have a higher performance share • P2P Security than a single supercomputer. • Emerging P2P Applications – The objective of such systems is to minimize • Conclusion communication and computation cost. Centralized? • Computation in networks of processing • Distributed system is an application that executes a nodes can be classified into centralized or distributed computations. collection of protocols to coordinate the actions of • A centralized solution relies on one node multiple processes on a communication network, being designated as the computer node that processes the entire application such that all components cooperate together to locally perform a single or small set of related tasks. • The central system is shared by all the users all the time. • There is single point of control and single point of failure. 1
  • 2.
    Examples of DistributedSystems • The Internet • The collaborating computers can access remote – Heterogeneous resources as well as local resources in the network of computers distributed system via the communication network. and applications • The existence of multiple autonomous computers is – Implemented through the Internet transparent to the user in a distributed system. Protocol Stack – The user is not aware that the jobs are executed by multiple computers subsist in remote locations. – A centralized algorithm is at the heart of a single computer. – A distributed algorithm is at the heart of a society of computers Computer Networks vs. Distributed Systems Distributed…. • Distributed systems are built up on top of existing networking and operating systems software. • Computer Network: the autonomous computers are • The Middleware enables computers to coordinate their activities explicitly visible and to share the resources of the system – Middleware is the bridge that connects distributed applications across • Distributed System: existence of multiple dissimilar physical locations, with dissimilar hardware platforms, network autonomous computers is transparent technologies, operating systems, and programming languages. • Middleware provides standard services such as naming, concurrency • Many problems in common control, event distribution, authorization to specify access rights to resources, security etc. • Normally, every distributed system relies on services provided by a computer network. 2
  • 3.
    Computing Platforms Evolution:Breaking Administrative Barriers Foster-Kesselman • The Foster-Kesselman duo organized in Ian Foster 1997, at Argonne National Laboratory, Mathematics and Computer a workshop entitled “Building a Science Division Computational Grid”. 2100 2100 2100 2100 Argonne National Laboratory Argonne, IL 60439 P ? • At this moment the term “Grid” was E born. R • The workshop was followed in 1998 by 2100 2100 2100 2100 F 2100 Administrative Barriers O R the publication of the book “The Grid: M Individual Group Blueprint for a New Computing A N Department Infrastructure” by Foster and C Campus Kesselman themselves. Carl Kesselman E State Information Sciences Institute National • For these reasons they are not only to University of Southern Globe Inter Planet be considered the fathers of the Grid California Universe but their book, which in the meantime Marina del Rey, CA 90292 was almost entirely rewritten and re- published in 2003, is also considered the Desktop SMPs or Local Enterprise Global Inter Planet “Grid bible”. (Single Processor) SuperCom Cluster Cluster/Grid puters Cluster/Grid Cluster/Grid ?? The Need for Collaboration? Electric Grid and Grid Computing • Computing grids are conceptually not unlike • The worldwide business demands intense electrical grids. problem-solving capabilities for incredibly • Electric power grid - a variety of resources contribute power into a shared "pool" for many complex problems consumers to access on an as-needed basis. – the need for dynamic collaboration of many – In an electrical grid, wall outlets allows us to link to an infrastructure of resources that computing resources to be able to work together. generate, distribute, and bill for electricity. • This is a difficult challenge across all the technical – When you connect to the electrical grid, you don‟t need to know where the power plant is communities to achieve this level of resource or how the current gets to you. collaboration within the bounds of the necessary • Grid computing uses middleware to coordinate disparate IT resources across a network, quality requirements of the end user. allowing them to function as a virtual whole. – The goal of a computing grid, like that of the electrical grid, is to provide users with access to the resources they need, when they need them. Why Grids ? Large Scale Exploration needs them Solving technology problems using computer modeling, simulation and analysis Geographic Information Systems Life Sciences Aerospace CAD/CAM Military Applications 3
  • 4.
    CERN’s Large HadronCollider Client-Server Model 1800 Physicists, 150 Institutes, 32 Countries The most widely used Client invocatio n Server invocatio n result result Server Client 100 PB of data by 2010; 50,000 CPUs? Key: Process: Computer: Source The Large Hadron Collider (LHC) Router A gigantic scientific instrument near Geneva It is a particle accelerator used by physicists to study the smallest known “Interested” particles – the fundamental building blocks of all things. End-host Client-Server Why P2P? Source Router “Interested” End-host 4
  • 5.
    Client-Server Why P2P? Overloaded! Personal Computers 80% idle CPU time Internet Laptop 90% idle CPU time Source Computers in our Lab Router 99% idle CPU time ! Hot Spots become hotter “Interested” End-host What is driving P2P? Problem with Client-Server Model • Clients are not so dumb. – Scalability • Billions of Mhz CPU, tons of terabytes • As the number of users increases, there is a higher disk, millions of gigabits network demand for computing power, storage space, and bandwidth, … bandwidth associated with the server-side – Reliability – Unused resources. • The whole network will depend on the highly loaded server to function properly Computer System Taxonomy P2P – An overlay network • P2P overlay network C Computer Systems – The connected nodes E construct a virtual overlay Centralized Systems network on top of the F Distributed Systems (mainframes, SMPs) underlying network infrastructure B Client - server C Peer- to- Peer – Peer-to-peer network E topology is a virtual overlay A at application layer F G B D 30 5
  • 6.
    Typical Characteristics • Large Scale: lots of nodes (up to millions) Internet Client Client Cache • Dynamicity: frequent joins, leaves, failures Client Proxy Client • Little or no infrastructure Client server server Client Peer-to-peer model – No central server Congestion zone • Symmetry: all nodes are “peers” – have same role Client Client/ Client/ Client/ Client Server Server Client Server Client Client/ Server Client/ Client/server model server server Server Client/ Congestion zone Client/ Server Server Client/ Server Client/ Server What is it... P2P Dominates Internet Traffic • P2P computing is the sharing of computer resources and services by direct exchange between systems. • These resources and services include the exchange of • P2P has dominated Internet traffic information, processing cycles, cache storage, and disk storage In 2006, more than 60% of Internet traffic for files. • P2P computing takes advantage of existing desktop computing power and networking connectivity, – allowing economical clients to leverage their collective power to benefit the entire enterprise. • In a P2P architecture, computers that have traditionally been used solely as clients communicate directly among themselves and can act as both clients and servers, assuming whatever role is most efficient for the network. • Each node (peer) called servent acts as both a SERVer and a cliENT Shared folder, neighbors Client and server Some Statistics about P2P Systems Peer Peer • More than 200 million users registered with skype, Peer around 10 million on-line users. (2007) Search Peer Peer • Around 4.7M hosts participate SETI@Home (2006) Peer • BT accounts for 1/3 of Internet traffic (2007) • More than 200,000 simultaneous online users on PPLive Retrieve (streaming video network). (2007) File Peer Peer • More than 3,000,000 users downloaded PPStream. (2008) Peer Peer Peer 36 6
  • 7.
    P2P Applications • In Peer-to-Peer (P2P) computing, applications are segregated into three main categories: – distributed computing, – file sharing, and – collaborative applications • The three categories of P2P serve different purposes – Distributed computing applications typically require the decomposition of larger problem into smaller parallel problems – File sharing applications require efficient search across wide area networks and – Collaborative applications require update mechanisms to provide consistency in multi-user environment P2P Network Architectures P2P Computing • Centralized (Napster) • File sharing (e.g., • Decentralized Gnutella, Freenet, Communication and collaboration Groove Skype – Unstructured (Gnutella) Limewire, KaZaA) – Structured (Chord) • Collaboration (e.g., Magi, Groove, Jabber) Napster • Hierarchical (MBone) • Distributed computing Gnutella Kazaa Freenet File sharing • Hybrid (EDonkey) (e.g., SETI@home, Overnet Search for SETI@Home Extraterrestrial folding@Home Intelligence) Distributed computing Computer Systems Centralized Systems Distributed Systems (mainframes, SMPs, workstations) Client - server Peer-to-Peer P2P FILE SHARING APPLICATIONS Centralized Decentralized Structured Unstructured 7
  • 8.
    P2P Applications Napster: Example • File sharing (music, movies, …) – utilise the idle disk space for storage and the existing m5 network bandwidth for search and download. m6 E – The cost of operation is very low F m1 A D • majority of peers collect only objects that they are E? E m2 B m4 interested in anyway. m3 m4 C D E? m5 E – Eg: Napster, KaZaA and Gnutella m5 m6 F C A m3 m1 m2 File Sharing Services Unstructured P2P • Publish – insert a new file into the network Flooded to connected peers Flooded between supernodes • Lookup – given a file name X, find the host that stores the file • Retrieval – get a copy of the file search transfer supernode • Join – join the network 2.query • Leave – leave the network – Neighbors 1.query peer node Centralized P2P File Sharing: Gnutella • Utilize a central directory for object location • Gnutella is a file sharing protocol • For file-sharing P2P, location inquiry Centralized Server form central servers then downloaded directly from peers • Gnutella was originally designed by Nullsoft, a • Benefits – Simplicity subsidiary of America Online. • – Limited bandwidth usage Drawbacks 1. query • Its architecture is completely decentralised and – Unreliable (single point of failure), performance bottleneck, and upload indexes distributed scalability limits – Vulnerable to DoS attacks 2. response • When a client wishes to connect to the network – Copyright infringement they run through a list of nodes that are most likely to be up or take a list from a website and then 3. transfer connect to how ever many nodes they want 8
  • 9.
    Gnutella Search Mechanism Peer-to-Peer File Sharing is all about the trading of copyrighted music and videos without paying anything to the authors  Assume: m1’s neighbors are m2 and m3; m3’s neighbors are m4 and m5;… A,B,C,D,E,F are resources  TTL m5 query E music m6 category F D E E? E? m4 KaZaA E? Native Windows E? Application C A banner B m3 m1 ad m2 3 million users online sharing 4 PetaBytes of data • Advantages – Fast lookup – Low join and leave overhead – Popular files are replicated many times, so lookup with small TTL will usually find the file • Can choose to retrieve from a number of sources Searching • Disadvantages – Not 100% success rate, since TTL is limited – Very high communication overhead – Uneven load distribution Kazaa Search in Unstructured P2P Two general types of search in unstructured p2p: Blind: try to propagate the query to a sufficient number of nodes (example Gnutella) Informed: utilize information about document locations Sharman Networks Kazaa is a file sharing program that allow you to download audio,video, images, documents and software files. 9
  • 10.
    Blind Search Methods APS – an example BFS and Random Walk Node J holds the requested object Nodes deploy 2 walkers, initially All index values are 20 TTL=3 • BFS Random walks •In unstructured networks, flooding would exhaust bandwidth of network. Collaborative Community Informed search • Rapidly changing work environment – Out-sourcing, in-sourcing, home-sourcing – Tight integration and team work with customers,  Informed: utilize information about document partners, vendors locations. • P2P allows management of documents at level of closed working groups.  APS • The collaboration software is designed to improve the productivity of individuals with common goals or interests. • Groove is a collaborative P2P system (http://www.groove.net) – Part of the Microsoft Office system – Document sharing and collaboration – • vital for a business. – Office Groove 2007 is a collaboration software program • helps teams work together dynamically and effectively, even if team members work for different organizations, or work remotely. Work Together: Anyone, Anytime, Anyplace Microsoft Office Groove 2007 Adaptive Probabilistic Search • Each node keeps a local index Example (indices at node A) consisting of one entry for each object it has requested per neighbor. A chooses B with Pr=0.3 • Index values represent the A chooses C with Pr =0.5 probability of finding that object A chooses D with Pr=0.2 through that neighbor • Searching is based on the simultaneous deployment of k walkers and probabilistic forwarding. • if a hit occurs, the walker terminates successfully. • On a miss, the query is forwarded to one of the node‟s neighbors. 10
  • 11.
    Distributed Computing: SETI@home Search for Extraterrestrial Intelligence -if we are alone in the universe or whether there is intelligent life somewhere else in the Universe.  Over two million computers crunching away and downloading data gathered from the Arecibo radio telescope in Puerto Rico, USA  The SETI@Home project is widely regarded as the fastest computer in the world  Sharing of resources such as computation power, network bandwidth and storage  Achieves computing power cheaper than a supercomputer can provide.  Developed by the Space Sciences Laboratory, at the University of California, Berkeley, in the United States.http://setiathome.ssl.berkeley.edu  Launched in 1996 How SETI@home works?  Collect data source  Use telescope to collect data source from outer space at Arecibo.  The SETI@home use data recorder to record data source on removable tape.  Distribution of data source  SETI@home divide data into fixed-size work units.  SETI@home distribute these data via Internet from the servers to a client program.  Client program computes result ,then returns it to the server, and gets another work unit. How SETI@home works? … • Scientific experiment - uses Internet-connected computers • Distributes a screen saver–based application to users • Applies signal analysis algorithms different data sets to process radio-telescope data. • Has more than 3 million users 3. SETI client gets data from server and runs Main Server 4. Client sends results back to server Radio-telescope Data 2. SETI client (screen Saver) starts 11
  • 12.
    Super nodes • “…a free program that uses the latest P2P…technology to • Super nodes are Skip clients run by users that have a bring affordable and high quality voice communications to people “good” Internet connection and a “good” computer. all over the world…” • Having a good Internet connection means having a public • Skype offers voice, video, chat and data transfer IP address, without firewall restrictions. services over IP • A good computer is a machine that can forward other • The first stable version of Skype has been released in July users‟ communications and handle many connections. 2004, since then the number of users kept on growing. • SN have a role of relay in the network • Nowadays Skype claims having more than 20 millions – Hence, they need a better connectivity and better performances. accounts and between 4 and 6 millions of users • 1 SN are used to connect SC together. simultaneously connected. Skype Skype Software features Skype – login • VoIP from computer to computer – The most used feature especially. • Skype clients directly connect to login • VoIP from computer to regular phone (Skype Out) – By registering on Skype‟s website it is possible buy credit and then call all over servers, whose IP addresses are hard the world with very interesting rates compared to rates applied by phone companies. coded within the software. • Video conferencing Introduced in Skype2.0 in 2006. – In this connection the login name and • Instant Messaging This feature is comparable to many other the version are sent in clear text format. instant messaging clients like MSN Messenger, Yahoo! Messenger, Google Talk, etc. – The main difference is that Skype does not tell the user whether the person he • The login server stores all of user is chatting with is typing or not. This is due to the P2P design of the Skype network. names and passwords and ensures • File Transfer that names are unique across the – The Skype network design has a big influence on the quality of file transfers. It can make it very fast (1Mbps) or very slow (3 kbps). Skype name space Internet Telephony - Skype • The participants form a self-organizing • Connection to a bootstrap node P2P overlay network to locate and – When SC (Skype Client) is installed the first time it communicate with other participants. come with a list of SN to connect to. • The bandwidth is shared and the sound or video in real time is shared as resource – First, the Skype Client tries to connect to 5 SN sending • Skype has a similar architecture as its a UDP packet to IP addresses of super nodes predecessor KaZaA randomly chosen in the host cache. • There are three types of nodes in the – When the client finds a super node to connect to, it Skype network: refreshes its list of active and available super nodes in – Ordinary-peers host cache. – Super-nodes – Central login server – SC connects to a SN • Communications are encrypted (RSA) 12
  • 13.
    Traffic volume contenttype (Germany, BitTorent) Skype - user search • Similar to KaZa (searching for callee) • Client sends an user name to SN and as an answer receives few IP addresses and port numbers • Subsequently the client contacts these nodes • If it cannot find the user it sends request to its SN once again and as a result receives another few IP addresses and port numbers • The process continues until the user is found What is PPLive? Skype - call establishment What is PPLive? – An online video broadcasting and advertising • Routing in the Skype overlay network is done by network • Provides an online viewing experience the SN. comparable to that of traditional TV broadcasting • 75 million global installed base and 20 • When a SC tries to establish a call, it first ask its million monthly active users • 600+ channels on PPLive with content SN (if it is not a SN itself) where is the callee and ranging from news, music, sports, movies, tries to connect directly to it. games, live video and other interactive services to a global audience – An efficient P2P technique platform and test – If the SC is restricted because of firewall then it will bench connect to the callee using a SN as a relay. History of PPLive: – If both a caller and a callee have public IP addresses, a • Bill’s story – Inventor of PPLive core technology caller sends signaling information over TCP to a callee – Dropped out of post-graduate program to start PPLive P2P VIDEO STREAMING PPLIVE • Streaming video is content sent in compressed form over the Internet and displayed by the viewer in real time. • With streaming video or streaming media, a Web user does not have to wait to download a file to play it - the media is sent in a continuous stream of data and is played as it arrives. • The user needs a player, which is a special program that uncompresses and sends video data to the display and audio data to speakers. • A player can be either an integral part of a browser or downloaded from the software maker's Web site. • P2P streaming – P2P TV • PPLive, PPStream, Joost (by Skype founders), … 13
  • 14.
    Streaming Tree Reconstructionafter a Peer Industry Trends Departure PPLive is well positioned to exploit the next explosive growth Advanced Video Streaming PPLive Applications VOIP Skype Downloading BitTorrent File Sharing Basic Napster Applications 2001 2003 2004 2005 PPLive Multi-tree Streaming Media Server (channel management server) - Retrieve list of channels via HTTP Membership Server -Retrieve small list of members nodes of interest via UDP Since all peers are involved in the data distribution, the load is spread among all nodes. Single-tree Streaming A snapshot of a tree-based overlay with 231 nodes • A common approach to P2P streaming is to organize participating peers into a single tree-structured overlay – The content is pushed from the source towards all peers. – This way organizing peers is called single-tree streaming. • In these systems, peers are hierarchically organized in a tree structure where the root is the stream source. • The content is spread as a continuous flow of information from the source down to the tree. 14
  • 15.
    Overall Architecture Web Server Tracker Bit Torrent •Created by Brahm Cohen in 2001 C A Peer Peer [Seed] B [Leech] Downloader Peer “US” [Leech] What is BitTorrent? Overall Architecture • A peer-to-peer file transfer protocol Tracker • Extremely popular today Web Server • “Pull-based” “swarming” approach • Each file split into smaller pieces • Nodes request desired pieces from neighbors • As opposed to parents pushing data C that they receive A • Pieces not downloaded in sequential Peer order Peer [Seed] B • Encourages contribution by all nodes [Leech] Downloader Peer “US” [Leech] Overall Architecture Overall Architecture Web Server Tracker Web Server Tracker C C A A Peer Peer Peer [Seed] Peer [Seed] B B [Leech] [Leech] Downloader Peer Downloader Peer “US” [Leech] “US” [Leech] 15
  • 16.
    Overall Architecture BitTorrent Lingo Web Server Tracker Seeder = a peer that provides the complete file. Initial seeder = a peer that provides the initial copy. Leecher Initial seeder One who is downloading C A Peer Peer [Seed] Leecher B [Leech] Downloader Peer Seeder “US” [Leech] Overall Architecture BitTorrent Basics Web Server Tracker • Files are broken into pieces. – Users each download different pieces from the original uploader (seed). – Users exchange the pieces with their peers to obtain the ones they are missing. A C • This process is organized by a centralized server Peer called the Tracker. Peer [Seed] B [Leech] Downloader Peer “US” [Leech] Overall Architecture Critical Elements Web Server Tracker • A web server – stores and serves the .torrent file. – For example: • http://bt.btchina.net Web Server • http://bt.ydy.com/ C A Peer The Lord of Ring.torrent Peer [Seed] B [Leech] Troy.torrent Downloader Peer “US” [Leech] 16
  • 17.
    BitTorrent Swarm Critical Elements • Swarm • The .torrent file – Set of peers all downloading the same file – Static „metainfo‟ file to contain necessary – Organized as a random mesh information : • Each node knows list of pieces downloaded by neighbors • URL of tracker • Node requests pieces it does not own from neighbors • Piece length – Usually 256 KB Matrix.torrent ------------------------------------------------- • SHA-1 hashes of each piece in file • swarm • IP address of the Tracker – The group of machines that are collectively connected for a particular file. • For example, if you start a BitTorrent client and it tells you that you're connected to 10 peers and 3 seeds, then the swarm consists of you and those 13 other people. How a node enters a swarm Critical Elements for file “popeye.mp4” • A BitTorrent tracker – The tracker maintains information about all BitTorrent • File popeye.mp4.torrent clients utilizing each torrent. hosted at a (well-known) – The tracker identifies the network location of each client webserver either uploading or downloading the P2P file associated with • The .torrent has address of a torrent. tracker for file – It also tracks which fragment(s) of that file each client possesses, to assist in efficient data sharing between clients. • The tracker, which runs on a • i.e. the tracker keeps track of all peers downloading file webserver as well, keeps For example: track of all peers • http://bt.cnxp.com:8080/announce downloading file • http://btfans.3322.org:6969/announce Critical Elements How a node enters a swarm for file “popeye.mp4” • An end user (peer) www.bittorrent.com – Guys who want to use BitTorrent must install • File popeye.mp4.torrent corresponding software or plug-in for web browsers. hosted at a (well-known) 1 – Downloader (leecher) : Peer has only a part ( or none ) of webserver the file. Peer • The .torrent has address of tracker for file – Seeder: Peer has the complete file, and chooses to stay • The tracker, which runs on a in the system to allow other peers to download webserver as well, keeps – BitTorrent clients connect to a tracker when attempting track of all peers to work with torrent files. downloading file • The tracker notifies the client of the P2P file location (that is normally on a different, remote server). 17
  • 18.
    How a nodeenters a swarm Three elements necessary to sharing a file for file “popeye.mp4” with BitTorrent www.bittorrent.com • The tracker - coordinates connections among the peers. – Tracker doesn't know anything of the actual contents of a file • File popeye.mp4.torrent – Generally, it's considered good manners to continue seeding a file after you hosted at a (well-known) have finished downloading, to help out others. webserver • The web server - stores and serves the .torrent file. 2 • The .torrent has address of • At least one seeder Peer – Contains any of the file's actual contents. tracker for file – The seeder is almost always an end-user's desktop machine (peer), rather Tracker • The tracker, which runs on a than a dedicated server machine. webserver as well, keeps – Seeding is monitored by the Tracker – Seed your file for a long time to prevent peers from being left with track of all peers incomplete files. downloading file • When you finish a download in BitTorrent, and you are only uploading, you're seeding! How a node enters a swarm File sharing for file “popeye.mp4” www.bittorrent.com Large files are broken into pieces of size between • File popeye.mp4.torrent hosted at a (well-known) 64 KB and 1 MB webserver Peer • The .torrent has address of tracker for file 3 Tracker • The tracker, which runs on a webserver as well, keeps track of all peers downloading file 1 2 3 4 5 6 7 8 Swarm BT: publishing a file A trivial example {1,2,3,4,5,6,7,8,9,10} Harry Potter.torrent Bob User Seeder: John Web Server {} {1,2,3} Tracker {1,2,3,5} {} {1,2,3} {1,2,3,4} {1,2,3,4,5} User Downloader: Seeder: Downloader: User Downloader A B C Downloader Joe Fan Bin 18
  • 19.
    Types of P2PAttacks P2P Technical Challenges • Poisoning: a client can provide content that doesn‟t match the description. • Routing protocols – A client A, can broadcast a message saying it needs file • Network topologies „X‟. A malicious client can send a message back to A • Peer discovery saying it has file X, then send it file Y. • Communication/coordination protocols • Denial of Service attacks that decrease or cease • Quality of service total capable network activity. • Security • Defection attacks which allow a client to participate on the network with a very low upload-to- download ratio. Types of P2P Attacks…. P2P SECURITY • Virus attacks, where a malicious client can add viruses into files shared on the network. • Malware attacks, where the P2P software Security is the condition of being protected contains spyware. against danger or loss. • Filtering attacks, where network operators may attempt to prevent P2P network data from being carried out. P2P Security Attacks On & From • P2P file sharing networks are constantly under • Attacks on P2P systems: attack. • P2P is potentially more vulnerable than client server. – Decentralized – More difficult to manage and control • Attacks from P2P Systems: • Need to understand the security issues for architecting future P2P apps 111 114 19
  • 20.
    Attacks on P2Psharing File Pollution Two types: Unsuspecting users Alice spread pollution ! • Pollution: file corruption  File Content • Index poisoning  File Index 115 Bob 118 File Pollution original content polluted content Unsuspecting users spread pollution ! pollution company Yuck File Pollution 116 119 File Pollution INDEX POISONING • Aim of the attacker is to make several peers believe that some popular file is present with the victim. • Attacker sends a location publish pollution message to every crawled peer. server • In this message, the attacker includes victim‟s IP address and port number. pollution • Attacker puts the file hash of a popular company file along with the message. file sharing • Peer B adds this file hash into it along network with the location of the victim. pollution pollution • When a peer C searches for that file, it server server may be told by some poisoned peer that victim has the file. pollution server 117 20
  • 21.
    Index Poisoning Free Riding • Peers share little or no data in P2P file-sharing systems index 23.123.78.6 title location • Measurement bigparty 123.12.7.98 smallfun 23.123.78.6 – Nearly 70% of Gnutella users share no files 123.12.7.98 heyhey 234.8.89.20 – Nearly 50% of all responses are returned by the top 1% of sharing hosts file sharing • Incentive mechanisms to encourage user network cooperation 234.8.89.20 121 Index Poisoning P2P Worms index 23.123.78.6 title location bigparty 123.12.7.98 Topological Passive 123.12.7.98 smallfun 23.123.78.6 heyhey 234.8.89.20 Scan Worms Worms bighit 111.22.22.22 A computer worm is a self-replicating malware computer program. 234.8.89.20 It uses a computer network to send copies of itself to other nodes 111.22.22.22 It may do so without any user intervention. 122 ROUTING TABLE POISONING TOPOLOGICAL WORM ATTACK • The aim of the attacker is to make the peers add victim as their neighbors • Attacker sends node announcement messages to every crawled peer. • Attacker includes victim‟s IP address and port number in these messages • The peers add victim as their neighbor • Query messages are forwarded to the victim 21
  • 22.
    TOPOLOGICAL WORM ATTACK Effects • Eating up free disk space • Benjamin opens a Web page, called benjamin.xww.de to display banner ads. – One day morning the Benjamin.xww.de Web site had a message saying: "Domain closed due to massive abuse." PASSIVE P2P WORMS • Vulnerability in the protocol • Wait for the vulnerable targets to contact them • Case 1 – Worm can create infected copies of itself with attractive filenames and place them in the shared folder of the P2P client or will replace the files present in the shared folder with itself How vulnerable is BitTorrent? – e.g. VBS.Gnutella, Benjamin Worm etc. • Case 2 – Answers positively to a proportion of search queries by changing the name of the corrupted file to match the search query – e.g. Gnuman 131 P2P-Worm.Win32.Benjamin.a Pollution Attack • P2P-Worm.Win32.Benjamin.a (Kaspersky Lab) is also known as: Worm.P2P.Benjamin.a (Kaspersky Lab), • 1. The peers W32/Benjamin.worm (McAfee), receive the peer W32.Benjamin.Worm (Symantec), Win32.HLLW.Benjamin (Doctor Web) list from the • This worm uses the Kazaa file exchange P2P network tracker. to spread itself. • Benjamin is written in Borland Delphi and is approximately 216 Kb in size - it is compressed by the AsPack utility. 22
  • 23.
    Pollution Attack DDOS Attack • 2. One peer • DDOS = Distributed denial of service contacts the • Based on the fact the BitTorrent Tracker has no attacker for a mechanism for validating peers. chunk of the file. • Uses modified client software Pollution Attack DDOS Attack • The attacker sends • 1. The attacker back a false downloads a large chunk. number of torrent • This false chunk files from a web will fail its hash server. and will be discarded. Pollution Attack DDOS Attack • 4. Attacker • 2. The attacker parses requests all chunks the torrent files with a modified BitTorrent from swarm and client and spoofs his IP wastes their address and port upload bandwidth. number with the victims as he announces he is joining the swarm. 23
  • 24.
    Current Solutions: Pollution DDOS Attack Attacks • 3. As the tracker • Blacklisting receives requests for a – Achieved using software such as Peer Guardian or list of participating moBlock. peers from other – Blocks connections from blacklisted IPs which are clients it sends the downloaded from an online database. victims IP and port number. Solutions – TRUST and REPUTATION DDOS Attack • Most of the solutions proposed to solve the problem of attacks are • 4. The peers then based on building trust (and/or reputation) between attempt to the peers connect to the • Some of the popular approaches are: – DCRS - Bit Torrent victim to try and – EigenTrust download a chunk – XRep of the file. • These approaches do slow down the attack Attack illustration What is Trust? What is reputation? • Trust – a peer‟s belief in another peer‟s capabilities, honesty victim and reliability based on its own experiences. • Reputation – a peer‟s belief in another peer‟s capabilities, Who has the files? honesty and reliability based on recommendations received Tracker from other peers. clients – Reputation can be centralized, computed by a third party or it can Discussion be decentralized, computed independently by each other after forum asking other peers recommendations. Victim has the files! .torrent .torrent .torrent .torrent .torrent .torrent attacker 24
  • 25.
    What is Trust?…….. An Example Trust Management System • Both Trust and Reputation are used to evaluate a peer‟s (BitTorrent) trustworthiness. • Trust and Reputation increase or decrease with further • Debit-Credit Reputation System experience. • Each client calculates a local trust • Trust and reputation both depend on some context. score for their peers Based on valid pieces uploaded /downloaded • For example: • Tracker combines these individual – Mike trusts John as his doctor, but he doesn‟t trust John as a scores to make a global score mechanic who can fix his car. • In the context of seeing a doctor, John is trustworthy • In the context of fixing a car, John untrustworthy. DCRS… …(cont’d) What is Trust Management ? Local Trust Score Computation • “Trust Management” was first coined by Blaze et. al 1996 Fij=Uij-Dij, Uij – the number of chunks that i uploaded to j, – a coherent framework for the study of security Dij- the number of chunks that i downloaded from j policies, security credentials and trust Using Fij, the local trust score LTij is computed as relationships. -1 if bogus chunk is uploaded by peer j 0 if Fij >t 1 if Fij <= t, where „t‟ is the fairness threshold Reputation Management DCRS… …(cont’d) • Need for trust mechanisms Global Trust Score Computation – To assess trustworthiness of peers and the content • Global Trust Scores are a representation the rest of the • Malicious peers generate unlimited number of inauthentic swarms opinion of a peer. files • At regular interval the tracker receives the local trust – To deter malicious behavior scores of peers in the swarm. • Reputation is an assumption that past behavior is • The tracker chooses „k‟ , where „k‟ is < the number of indicative of future behavior peers in the swarm, random local trust scores for peer j in the swarm. • Use of reputation to build trust • Tracker uses k local trust scores for peer j and sets the average of them as the global trust score for j 25
  • 26.
    DCRS…(cont’d) • P2P systems already store a huge amount of widely • Global trust managed by the tracker prevents clients varying data collected from different sources. from being dishonest. • If this data, distributed over large number of peers, can • Solve the issue of pollution attacks by ignoring be integrated, untrustworthy peers – This represents a very valuable data repository that, upon – Trust systems are more flexible than blacklisting mining, may give very exciting and useful results. because peers can earn back their trust through good • Peer-to-peer K-means Algorithm behavior. – K-means clustering partitions a collection of data • Prevent DDOS attacks because the victim will earn a low tuples, into K disjoint, exhaustive groups (clusters), trust score and be ignored. where K is a user-specified parameter. Example: Topic-wise document clustering in a P2P document repository • Documents stored in different peers are clustered based on three subjects – movies – baseball – hurricane Other Emerging P2P Applications by exchanging information with other peers. • In a P2P clustering, – some peers may not be present in the network all the time, and may join or leave the network while the clustering is in progress. Distributed Data Mining in P2P Networks Cloud computing • Data mining, the extraction of hidden predictive information from large databases • Cloud computing is a computing paradigm shift where computing is moved away from personal computers or • Most off-the-shelf data mining systems are designed to an individual server to a “cloud” of computers. work as a monolithic centralized application. • Users of the cloud only need to be concerned with the • Distributed data mining (DDM) deals with the problem of computing service being asked for, as the underlying data analysis in environments with distributed data, details of how it‟s achieved are hidden. computing nodes, and users. • Done through pooling all computer resources together • P2P networks are well-suited to distributed data mining and being managed by software rather than a human. (DDM) • Prominent players include Google (AppEngine), • A primary goal of P2P data mining is to achieve the same Microsoft (Azure), Amazon (EC2), Yahoo-Apache (or close) data mining result as a centralization approach, (Hadoop) and Cisco-EMC (Acadia) . without moving any data from its original location. Souptik Datta Kanishka Bhaduri Chris Giannella Ran Wolff, Hillol Kargupta, “Distributed Data Mining in Peer-to-Peer Networks” 26
  • 27.
    Cloud Architecture So What’s the Issue? • These super server-warehouses are expected to consume around 300 MegaWatts (MW) of electricity a month. Individuals Corporations Non-Commercial • Existing large data-centres consume anywhere from 20 to 50 MW of electricity (enough to power 40,000 homes). – Hence, the energy consumption of the million-server warehouse raises serious concerns on their environmental sustainability. • The environmental impact of cloud computing has not received the desired attention of the research community Cloud Middle Ware and needs to be addressed. Storage Provisioning OS Provisioning Network Provisioning Service(apps) Provisioning SLA(monitor), Security, Billing, – It is estimated that Google‟s Data Center‟s alone consume over Payment 1.5% of the electricity produced world-wide. – The larger data-centers in the US are estimated to consume 25.6 Resources GwH of electricity per year and produce 17006 tones of Co2 emissions Services Storage Network OS The Peer Enterprises Framework The Scale of the Cloud Google‟s million-server warehouse (Oregon, USA). Each building is approximately the size of 2 football fields. Source IEEE Spectrum, Feb. 2009 PE vs Cloud The Scale of the Cloud Parameter Cloud Computing Peer Enterprises 1. Cost Expensive to provision. Involves creation of Uses already provisioned compute infrastructure. internet-scale data-centres costing hundreds of Hence, no new costs need to be incurred. millions of dollars. Each container 2. Energy Consumption Around 300 MegaWatts per month. Already provisioned and in-use. No additional Cooling houses 2500 energy consumption. Towers servers 3. Environmental Impact Yes. Very High. Yes. But, the PE concept does not place an additional load on the environment. 4. Service Migration Tedious. Since, vendor interoperatability is Easy. Organizations can enter into new contracts undefined as yet. and terminate existing contracts. 5. Degree of Decentralization Fairly centralized control. All applications Based on the decentralized P2P concept. No running in a single data-centre. Centralized centralized elements/control. Organizations are elements for load-balancing, scalability etc. free to negotiate resource sharing contracts as required. 6. Data Lock-In Yes. Once all data resides with a single vendor, Organizations can enter into multiple contracts to any connectivity faults can render the data avoid data lock-in by creating requisite irretrievable. redundancy. Schematic of the million-server warehouse. The largest most complex data-center in 7. Performance Expected to be very performant, since Less performant due to frequent node transience. dedicated compute infrastructure is involved. Performance enhancements and optimizations the world. Each container has built-in networking, cooling and storage bundled need to be devised. together. Source: IEEE Spectrum, Feb. 2009 8. Vendor Dependence High. Compute infrastructure from only one None. As many service providers can be used by provider can be used. entering into contracts. 27
  • 28.
    Problems/challenges for adhoc networks • Problems are due to – Lack of central entity for organization available – Limited range of wireless communication – Mobility of participants – Battery-operated entities WWW + Mobile Telephony = Mobile Access to Mobile P2P? Information • Transferring data from one mobile phone to another 700 • Mobile phone and network limit the possibilities of mobile P2P 600 Mobile Telephone – Low efficiency (CPU and memory) Users 500 – Low bandwidth – Low Power constraint due to energized by battery 400 Internet Users power 300 – Billing 200 100 Much more challenging as compared to traditional P2P 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 MANET: Mobile Ad hoc Networks Full mobile P2P in 2/2.5G Ad Hoc networks are wireless, self-organizing systems that provide functionality without infrastructure support. • In 2/2.5 there are limitations that are impossible to overcome: Ad hoc means that there are no central servers. – Operators do not allow to see mobile phones IP address  Content is distributed to several nodes instead of one server – Operators control data traffic MANET- A collection of wireless mobile nodes dynamically forming – Network does not offer any way to sustain active connection a network without any existing infrastructure and the relative in all situations position dictate communication links (dynamically changing). – Voice and data can not be transferred simultaneously 28
  • 29.
    A solution to2/2.5 P2P: MMS Computer aided P2P: short distance • MMS could be used as a way of sending data • Within short distance we would not have true mobile from one mobile node to another. P2P: However there are problems: – How to know who has the information you need? – MMS size is limited – MMS costs more than GPRS data • A better solution would be to control fixed network peer remotely A solution to 2/2.5 P2P:MMS Computer aided mobile P2P: remotely • We have to have a server that keeps a • For example over http we could control the fixed network peer record of MSISDN (IMSI) number and by using a program called mobile eMule the data that can be found from that number • Downloader asks the data and the person who is downloaded permits or denies download. Mobile Station International Subscriber Direct ory Number (MSISDN) is a number used to identify a mobile phone number internationally. IMSI:429 01 1234567890 MSISDN = CC + NDC + SN CC 429 Nepal CC = Country Code Nepal NDC 01 NDC = National Destination Code Telecom SN = Subscriber Number SN 1234567890 A better solution: computer aided P2P eMule • eMule is a free peer-to-peer file • All the major limitations could be overcome if the mobile sharing application for phone would be connected to a computer which has P2P Microsoft Windows. software • We would only need a software to communicate between • The name "eMule" comes from the computer and mobile phone: an animal called "Mule" which – Short distance: Infrared, Bluetooth etc. is somehow similar to a donkey – Remotely: Over HTTP 29
  • 30.
    3G • Deliver speeds up to 14.4 Mbit/s on the downlink and 5.8Mbit/s on the uplink. • Consumers will be charged on the quantity of data they transmit, not on how much time they are connected to the network. • With 3G you are constantly online and basically pay for the information you receive. • While third-generation packet based networks will allow users to be online all the time the capability for new applications is huge. eMule – how it works? Threats to mobile P2P • In 3G true mobile P2P is possible due to high bandwidth, efficient mobile phones and • Each file that is shared using eMule is hashed as a simultaneous voice and data capability hash list using the MD4 algorithm. – But will the operators allow P2P software since is would lead to the loss of revenues? – In the 3G network architecture, every data connection of a mobile terminal is routed • The MD4 hash, file size, filename, and search through the operator‟s network. This makes it possible for the operator to fully control attributes are stored on eD2k servers the traffic of mobile terminals. • Users can search for filenames in the servers • For example, a network operator has the power to allow or prevent terminal-to- terminal connections in its network. • Users are presented with the filenames and the unique • P2P protocols demand direct connections between the peers because their key idea identifier consisting of the MD4 hash for the file and is that the peers communicate directly with each other without any central server. the file's size that can be added to their downloads. – Lack of terminal-to-terminal connections would make it impossible for true P2P to • The client then asks the servers where the other exist. clients are using that hash. • Data transfer fees are currently quite high - reduces the willingness of users to share data in MP2P networks • The servers return a set of IP that indicate the • Use of MP2P applications may reduce the possibilities of operators to sell their own locations of the clients that share the file. services. • eMule then asks the peers for the file. • Viruses, spy etc. Computer aided mobile P2P: eMule P2P Based Software Engineering • With rapid development of the network technologies, 3. download 4. download software development is becoming more and more 1. login 2. search to peer to phone complicated. • Traditional SE management methods based on C/S structure have not been very competent for large scale software development. • Proposes a SE management method based on P2P – Overcomes the servers‟ bottlenecks existed in C/S • eMule is a working solution – Makes full advantages of computation resources • eMule has a large user base, currently averaging 3 to 5 million Lina Zhao, Yin Zhang, Sanyuan Zhang, and Xiuuzi Ye, “P2P-Based Software Engineering Management” 30
  • 31.
    Future • Semantic P2P • Cloud Computing • Data Mining • P2P Based Software Engineering • Audio/Video Streaming • Security – autonomic computing • Collaborative learning • Mobile P2P • Emergency First Response 31