EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

1,044 views
957 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,044
On SlideShare
0
From Embeds
0
Number of Embeds
234
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

  1. 1. Analyzing and Linking Big Data with Stratosphere Kostas Tzoumas kostas.tzoumas@tu-berlin.de www.user.tu-berlin.de/kostas.tzoumas/
  2. 2. Big Data■ Driven by cheaper storage and computation □ Cloud computing further enabling economies of scale □ Open-source software lowers barrier of entry■ Societal and economical impact □ Scientific breakthroughs will come from data exploration □ Success in business dictated by the ability to quickly draw insights from data signals■ Big Data Analytics □ Analytical workloads scale inversely with cost■ Major challenge: Information marketplaces □ Data as a resource, analytics as a product □ An “AppStore” for data 2 Linking and Analyzing Big Data with Stratosphere
  3. 3. The DOPA Vision■ Data Pools as collections of diverse data sets■ Example data pools □ Evolving history of the Web □ Financial and statistical data■ Data identification □ By assigning unique IDs to objects □ Enables data linkage■ Data Analytics □ Integration and analytics on diverse data sources □ Using a common framework and language 3 Linking and Analyzing Big Data with Stratosphere
  4. 4. Some Scenarios■ Sentiment and market analysis □ An SME producing consumer goods can analyze blogs and social network streams, and link them with customer demographics to perform sentiment analysis and market research■ Green houses □ A home buyer can find out the energy consumption and distribution over time in houses in a particular area by linking and analyzing energy with demographic data.■ Traffic analysis, transportation and construction planning □ Linking weather, traffic, road data Linking and Analyzing Big Data with Stratosphere
  5. 5. Big Data Analytics■ “Big Data” refers to different applications as well as more data □ Beyond traditional DW queries■ Open-source in data management □ Enabled primarily by Hadoop popularity □ Changed research landscape■ New systems are rethinking the complete data management stack at a massively parallel scale □ As systems mature, need to tackle hard and novel problems 5 Linking and Analyzing Big Data with Stratosphere
  6. 6. StratoSphereAbove the Clouds STRATOSPHERE6 Linking and Analyzing Big Data with Stratosphere
  7. 7. Stratosphere■ Collaborative Research Project □ 3 Universities, 5 research groups in the Berlin area■ Infrastructure for Big Data Analytics■ Bridge relational DBMSs and MapReduce worlds □ Intersection of functional languages and data parallelism □ Re-architecting data management systems for massive parallelism■ Open-source research platform (Apache)■ Used by a variety of Universities and research institutes 7 Linking and Analyzing Big Data with Stratosphere
  8. 8. Stratosphere Architecture $res = filter $e in $emp where Data Data Data $e.income > 30000; pool pool pool Compiler Query Processor PACT Optimizer Nephele ...8 Linking and Analyzing Big Data with Stratosphere
  9. 9. 2km resolution Stratosphere Use Cases 10TB1100km, 950km, 2km resolution 9 Linking and Analyzing Big Data with Stratosphere
  10. 10. The Meteor Query Language■ Stratosphere declarative front-end inspired by the IBM Jaql language■ Extensible and flexible □ Easy to add libraries, e.g., for data linkage, cleansing, mining □ Easy to integrate in language syntax■ Provides operators for Information Extraction and Data Linkage■ Time as a first-class concept 10 Linking and Analyzing Big Data with Stratosphere
  11. 11. The Nephele Execution Engine■ Executes Nephele schedules □ DAGs of already parallelized operator instances □ Parallelization already done by PACT optimizer■ Design decisions □ Designed to run on top of an IaaS cloud □ Predictable performance □ Scalability to 1000+ nodes with flexible fault-tolerance■ Permits network, in-memory (both pipelined), file (materialization) channels 11 Linking and Analyzing Big Data with Stratosphere
  12. 12. The PACT Programming Model■ Internal Stratosphere programming model □ Also exposed to the programmer for advanced functionality■ Dataflow, side-effect free programming model enabling massive parallelism■ Centered around the concept of second-order functions □ Generalization of MapReduce Map PACT Reduce PACT Match PACT CoGroup PACT 12 Linking and Analyzing Big Data with Stratosphere
  13. 13. Optimization■ Knowledge of PACT signature permits automatic optimization ala Relational DBMSs■ Emulates different hand-crafted MapReduce implementations■ Enables orders of Reduce (on tid) ↑(pid=tid, r=∑ k)↑ Sum up partial ranks Reduce (on tid) ↑(pid=tid, r=∑ k)↑ magnitude faster fifo part./sort (tid) programs Match (on pid) ↑(tid, k=r*p)↑ Join P and A Match (on pid) ↑(tid, k=r*p)↑■ Frees programmer from buildHT (pid) probeHT (pid) probeHT (pid) buildHT (pid) thinking about broadcast part./sort (tid) partition (pid) partition (pid) execution p A p A (pid, r ) (pid, tid, p) (pid, r ) (pid, tid, p) 13 Linking and Analyzing Big Data with Stratosphere
  14. 14. Ongoing Research Sink O Reduce (on tid) Sum up ↑(pid=tid, r=∑ k)↑ partial ranks UDF fifo Match Match (on pid) ↑(tid, k=r*p)↑ Join P and A UDF probeHT (pid) Reduce buildHT (pid) CACHE broadcast part./sort (tid) UDF UDF A (pid, tid, p) Map Map I fifo Src Src 1 2 p (pid, r )14 Linking and Analyzing Big Data with Stratosphere
  15. 15. Summary■ DOPA □ Bootstrapping the information economy by providing information marketplaces and related business models □ Brings together heterogeneous data pools □ Enables easy linkage and analytics across data pools via a flexible programming language■ Stratosphere □ Technical infrastructure for scalable analytics □ Pushes the MapReduce paradigm forward □ Focal point of several research initiatives across Europe and the world 15 Linking and Analyzing Big Data with Stratosphere
  16. 16. Acknowledgments■ FP7 STREP (DOPA), DFG FOR, EIT (Stratosphere)■ DOPA partners □ TU Berlin, IMR, DataMarket, OKKAM, Vico, ami■ Stratosphere partners □ TU Berlin, HU Berlin, HPI■ EIT partners □ TU Berlin, SICS, TU Delft, Inria, U. Trento, Aalto U., STACZKI 16 Linking and Analyzing Big Data with Stratosphere
  17. 17. Thank you! www.stratosphere.eu @stratosphere_eu kostas.tzoumas@tu-berlin.de17 Linking and Analyzing Big Data with Stratosphere

×