Effiziente Verarbeitung von grossen Datenmengen
Upcoming SlideShare
Loading in...5
×
 

Effiziente Verarbeitung von grossen Datenmengen

on

  • 185 views

Ein Vortrag von Tristan Schneider aus dem Hauptseminar "Personalisierung mit großen Daten".

Ein Vortrag von Tristan Schneider aus dem Hauptseminar "Personalisierung mit großen Daten".

Statistics

Views

Total Views
185
Views on SlideShare
185
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Effiziente Verarbeitung von grossen Datenmengen Effiziente Verarbeitung von grossen Datenmengen Presentation Transcript

  • Introduction Approaches Effziente Verarbeitung von grossen Datenmengen Teil II Tristan Schneider January 9, 2014 Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Inhalt Introduction Social Graph Problems and Motivation Approaches TAO Horton Pregel Trinity Unicorn Conclusion Comparison Future Work Effziente Verarbeitung von grossen Datenmengen Teil II Approaches Conclusion
  • Introduction Approaches Social Graph Consists of Nodes and Edges Describes Entities and their Relation Used by Facebook, Google, Amazon etc About 100+ million nodes and 10+ billion edges Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Problems and Motivation amount of data exceeds capability of a single machine necessary to distribute data and computation data access managed by framework different requirements (latency, throughput) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO developed by Facebook read optimized fixed set of queries Strength low latency access Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: Data Model data identified by 64-bit integer Objects (id) → (otype, (key → value)*) Associations (id1, atype, id2) → (time, (key → value)*) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: API fixed set of queries assoc add, assoc delete, assoc change type assoc get, assoc count, assoc range, assoc time range Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: Architecture data divided into shard (via hashing) each server handles one or more shard objects and their associations are in the same shard an object never changes the shard Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: Architecture servers divided in leaders and followers clients always communicate with followers cache misses and writes redirected to leader slave servers support master servers if necessary Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: Architecture: Scheme Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches TAO: Fault Tolerance and Performance efficiency and availability > consistency global mark for down server followers are interchangeable slave databases promoted to master, if master crashes Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion TAO: Fault Tolerance and Performance Figure: Write Access Latencies https://www.facebook.com/download/273893712748848/atc13-bronson.pdf Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Horton query language execution engine written in C# Strength interactive queries with low latency Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Horton: Data Model similar to TAO divided in partitions additional data can be attached (e.g. key-value-pairs) directed edges stored at source and target Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Horton: API horton query language initiated via client (library) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Horton: Architecture Graph Client Library translates query to regular expression Graph Coordinator translates regular expression to finite state machine and finds most effective execution plan Graph Partitions executes the finite state machine and traverses the graph Graph Manager provides an interface to administrate the graph Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Pregel C++ based computation consists of parallel iteration communication using messaging Strength high throughput (for analysis) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Pregel: Data Model graph divided in partitions partition assignment based on node id (hash(id) mod n) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Pregel: API implementation of a Vertex class (task) define methods like Compute(...), SendMessageTo(...) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Pregel: Architecture runs on a cluster management system uses distributed file system (eg. Bigtable) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Pregel: Basic Work Flow 1. copy task to worker machines, one is promoted to master 2. master assigns one or more partitions to each worker 3. master invokes supersteps 4. save graph after computation Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion Pregel: Fault Tolerance and Performance workers save their progress at checkpoint supersteps worker failure detected using ping reassign partitions failed servers to available workers reload state of the most recent available checkpoint superstep process termination if master failed Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Pregel: Fault Tolerance and Performance Figure: varying number of worker on 1 billion vertex binary tree http://kowshik.github.io/JPregel/pregel paper.pdf Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Trinity developed by Microsoft flexible in data and computation supports online query processing and offline computation on top well-connected cluster (memory cloud) based on TFS (similar to HDFS) Strength low latency and high throughput (not at the same time) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion Trinity: Data Model key-value-store one table for nodes one table for each type of relation relations represented by id-pairs in the specific table customisation possible with Trinity Structure Language (TSL) data backed up in persistent file system Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Trinity: API Trinity Desktop Environment (TDE) supports query requests (similar to Horton/SQL) supports offline computation (similar to pregel) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Trinity: Architecture Slaves Stores a part of the data, processes tasks and messages. Proxies Optional middle tier between slaves and clients. Handles messages, does not store data. Clients Responsible for user interaction with the cluster. Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Trinity: Architecture Figure: Trinity Cluster Structure https://research.microsoft.com/pubs/161291/trinity.pdf Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion Trinity: Fault Tolerance and Performance no ACID support, but atomicity of operations dead machines are replaced by alive ones, reload memory from TFS requesting machine will wait till the dead machine is replaced recovering the state of the most recent checkpoint superstep (similar to pregel) Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Trinity: Fault Tolerance and Performance Figure: Response time of subgraph match queries Effziente Verarbeitung von grossen Datenmengen Teil II https://research.microsoft.com/pubs/161291/trinity.pdf Conclusion
  • Introduction Approaches Unicorn in-memory social graph-aware indexing system search offering backend of Facebook based on Hadoop Strength Typeahead Good performance on complex queries. Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Unicorn: Data Model sharded data (similar to Facebooks TAO) indices built and converted using custom Hadoop pipeline Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion Unicorn: API Queries in Unicorn Query Language e.g. (term likers:104076956295773)) ≈ 6M Likers of ”Computer Science” apply allows to query a (truncated) set of id and then use those to construct a new query extract attaches matches as metadata within the forward index of the query set Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Conclusion Unicorn: Architeture top-aggregator dispatches the query to one rack-aggregator of each rack, combines and returns result rack-aggregator forwards the query to all index servers of its rack (high bandwidth), combines results index server about 40-80 machines per rack, stores adjacency lists, performs operations Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Unicorn: Fault Tolerance and Performance sharding and replication automatically replacing machines serving incomplete results is strongly preferable to serving empty results Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Conclusion Unicorn: Fault Tolerance and Performance (apply friend: likers:104076956295773) ≈ Friends of Likers of ”Computer Science” https://www.facebook.com/download/138915572976390/UnicornVLDBfinal.pdf Effziente Verarbeitung von grossen Datenmengen Teil II
  • Introduction Approaches Conclusion Comparison Framework TAO Horton Pregel Trinity Unicorn Query Language no yes no yes yes Effziente Verarbeitung von grossen Datenmengen Teil II low latency yes yes no yes yes high throughput no no yes yes no
  • Introduction Approaches Future Work query language vs fixed set queries all-in-one framework difficult (Trinity as best attempt) Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion
  • Introduction Approaches Thank you for your attention. Questions? Sources 1. 2. 3. 4. 5. https://research.microsoft.com/pubs/161291/trinity.pdf http://research.microsoft.com/pubs/162643/icde12 demo 679.pdf http://kowshik.github.io/JPregel/pregel paper.pdf https://www.facebook.com/download/273893712748848/atc13-bronson.pdf https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf Effziente Verarbeitung von grossen Datenmengen Teil II Conclusion