SlideShare a Scribd company logo
Analyzing and Linking Big Data
      with Stratosphere
            Kostas Tzoumas




    kostas.tzoumas@tu-berlin.de
 www.user.tu-berlin.de/kostas.tzoumas/
Big Data
■ Driven by cheaper storage and computation
      □ Cloud computing further enabling economies of scale
      □ Open-source software lowers barrier of entry
■ Societal and economical impact
      □ Scientific breakthroughs will come from data exploration
      □ Success in business dictated by the ability to quickly draw
        insights from data signals
■ Big Data Analytics
      □ Analytical workloads scale inversely with cost
■ Major challenge: Information marketplaces
      □ Data as a resource, analytics as a product
      □ An “AppStore” for data
  2                                      Linking and Analyzing Big Data with Stratosphere
The DOPA Vision
■ Data Pools as collections of diverse data sets
■ Example data pools
      □ Evolving history of the Web
      □ Financial and statistical data
■ Data identification
      □ By assigning unique IDs to objects
      □ Enables data linkage
■ Data Analytics
      □ Integration and analytics on diverse data sources
      □ Using a common framework and language



  3                                      Linking and Analyzing Big Data with Stratosphere
Some Scenarios
■ Sentiment and market analysis
   □ An SME producing consumer goods can analyze blogs and
     social network streams, and link them with customer
     demographics to perform sentiment analysis and market
     research
■ Green houses
   □ A home buyer can find out the energy consumption and
     distribution over time in houses in a particular area by linking
     and analyzing energy with demographic data.
■ Traffic analysis, transportation and construction planning
   □ Linking weather, traffic, road data



                                      Linking and Analyzing Big Data with Stratosphere
Big Data Analytics
■ “Big Data” refers to different applications as well as
  more data
      □ Beyond traditional DW queries
■ Open-source in data management
      □ Enabled primarily by Hadoop popularity
      □ Changed research landscape
■ New systems are rethinking the complete data
  management stack at a massively parallel scale
      □ As systems mature, need to tackle hard and novel problems




  5                                     Linking and Analyzing Big Data with Stratosphere
StratoSphere
Above the Clouds




    STRATOSPHERE



6                  Linking and Analyzing Big Data with Stratosphere
Stratosphere
■ Collaborative Research Project
      □ 3 Universities, 5 research groups in the Berlin area
■ Infrastructure for Big Data Analytics
■ Bridge relational DBMSs and MapReduce worlds
      □ Intersection of functional languages and data parallelism
      □ Re-architecting data management systems for massive
        parallelism
■ Open-source research platform (Apache)
■ Used by a variety of Universities and research institutes



  7                                      Linking and Analyzing Big Data with Stratosphere
Stratosphere Architecture
                                                 $res =
                                                   filter $e in $emp
                                                   where
    Data   Data   Data                             $e.income > 30000;

    pool   pool   pool
                                                          Compiler



                    Query
                    Processor                         PACT Optimizer




                                                          Nephele
                    ...

8                               Linking and Analyzing Big Data with Stratosphere
2km resolution
                 Stratosphere Use Cases




                     10TB
1100km,




                     950km,
                     2km resolution




                 9                        Linking and Analyzing Big Data with Stratosphere
The Meteor Query Language
■ Stratosphere declarative front-end inspired by the IBM
  Jaql language
■ Extensible and flexible
      □ Easy to add libraries, e.g., for data linkage, cleansing, mining
      □ Easy to integrate in language syntax
■ Provides operators for Information Extraction and Data
  Linkage
■ Time as a first-class concept




 10                                       Linking and Analyzing Big Data with Stratosphere
The Nephele Execution Engine
■ Executes Nephele schedules
      □ DAGs of already parallelized operator instances
      □ Parallelization already done by PACT optimizer
■ Design decisions
      □ Designed to run on top of an IaaS cloud
      □ Predictable performance
      □ Scalability to 1000+ nodes with flexible fault-tolerance
■ Permits network, in-memory (both pipelined), file
  (materialization) channels




 11                                      Linking and Analyzing Big Data with Stratosphere
The PACT Programming Model
■ Internal Stratosphere programming model
      □ Also exposed to the programmer for advanced functionality
■ Dataflow, side-effect free programming model enabling
  massive parallelism
■ Centered around the concept of second-order functions
      □ Generalization of MapReduce

  Map PACT           Reduce PACT       Match PACT                  CoGroup PACT




 12                                   Linking and Analyzing Big Data with Stratosphere
Optimization
■ Knowledge of PACT signature permits automatic
  optimization ala Relational DBMSs
■ Emulates different hand-crafted MapReduce
  implementations

■ Enables orders of             Reduce (on tid)
                               ↑(pid=tid, r=∑ k)↑
                                                               Sum up
                                                             partial ranks
                                                                              Reduce (on tid)
                                                                             ↑(pid=tid, r=∑ k)↑

  magnitude faster                        fifo                                    part./sort (tid)


  programs                       Match (on pid)
                                 ↑(tid, k=r*p)↑             Join P and A
                                                                                Match (on pid)
                                                                                ↑(tid, k=r*p)↑

■ Frees programmer from   buildHT (pid)
                                             probeHT (pid)
                                                                        probeHT (pid)
                                                                                             buildHT (pid)
  thinking about             broadcast           part./sort (tid)        partition (pid)      partition (pid)

  execution                      p                    A                          p                   A
                              (pid, r )          (pid, tid, p)                (pid, r )         (pid, tid, p)



 13                                  Linking and Analyzing Big Data with Stratosphere
Ongoing Research

            Sink
                                                 O
                                        Reduce (on tid)  Sum up
                                      ↑(pid=tid, r=∑ k)↑ partial ranks
            UDF                              fifo
           Match
                                        Match (on pid)
                                        ↑(tid, k=r*p)↑        Join P and A

                    UDF                           probeHT (pid)
                   Reduce        buildHT (pid)
                                                     CACHE
                                    broadcast       part./sort (tid)
     UDF            UDF
                                                          A    (pid, tid, p)
     Map            Map
                                         I
                                       fifo

     Src             Src
      1               2                 p     (pid, r )


14                          Linking and Analyzing Big Data with Stratosphere
Summary
■ DOPA
      □ Bootstrapping the information economy by providing
        information marketplaces and related business models
      □ Brings together heterogeneous data pools
      □ Enables easy linkage and analytics across data pools via a
        flexible programming language
■ Stratosphere
      □ Technical infrastructure for scalable analytics
      □ Pushes the MapReduce paradigm forward
      □ Focal point of several research initiatives across Europe and
        the world


 15                                      Linking and Analyzing Big Data with Stratosphere
Acknowledgments
■ FP7 STREP (DOPA), DFG FOR, EIT (Stratosphere)
■ DOPA partners
      □ TU Berlin, IMR, DataMarket, OKKAM, Vico, ami
■ Stratosphere partners
      □ TU Berlin, HU Berlin, HPI
■ EIT partners
      □ TU Berlin, SICS, TU Delft, Inria, U. Trento, Aalto U., STACZKI




 16                                       Linking and Analyzing Big Data with Stratosphere
Thank you!
         www.stratosphere.eu
          @stratosphere_eu
     kostas.tzoumas@tu-berlin.de


17                  Linking and Analyzing Big Data with Stratosphere

More Related Content

What's hot

Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
amsantac
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Amazon Web Services
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Sumeet Singh
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
Rim Moussa
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 
TGS GPS- Russian well database
TGS GPS- Russian well database TGS GPS- Russian well database
TGS GPS- Russian well database
TGS
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
Rim Moussa
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
Chen Wu
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
The HDF-EOS Tools and Information Center
 
Reading HDF family of formats via NetCDF-Java / CDM
Reading HDF family of formats via NetCDF-Java / CDMReading HDF family of formats via NetCDF-Java / CDM
Reading HDF family of formats via NetCDF-Java / CDM
The HDF-EOS Tools and Information Center
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchRim Moussa
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 

What's hot (20)

Spatial Data Science with R
Spatial Data Science with RSpatial Data Science with R
Spatial Data Science with R
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
Unit3 slides
Unit3 slidesUnit3 slides
Unit3 slides
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
TGS GPS- Russian well database
TGS GPS- Russian well database TGS GPS- Russian well database
TGS GPS- Russian well database
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
Reading HDF family of formats via NetCDF-Java / CDM
Reading HDF family of formats via NetCDF-Java / CDMReading HDF family of formats via NetCDF-Java / CDM
Reading HDF family of formats via NetCDF-Java / CDM
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 

Viewers also liked

EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012  Simon Riggs - Open Data, Open Database: PostgreSQLEDF2012  Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQLEuropean Data Forum
 
EDF2012 Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...
EDF2012   Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...EDF2012   Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...
EDF2012 Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...European Data Forum
 
EDF2012 Simon Redfern - Open Bank Project
EDF2012   Simon Redfern - Open Bank ProjectEDF2012   Simon Redfern - Open Bank Project
EDF2012 Simon Redfern - Open Bank ProjectEuropean Data Forum
 
EDF2012 Mariana Damova - Factforge
EDF2012   Mariana Damova - FactforgeEDF2012   Mariana Damova - Factforge
EDF2012 Mariana Damova - FactforgeEuropean Data Forum
 
EDF2012 Andrew Farrow - (Copy)right information in the digital age
EDF2012   Andrew Farrow - (Copy)right information in the digital ageEDF2012   Andrew Farrow - (Copy)right information in the digital age
EDF2012 Andrew Farrow - (Copy)right information in the digital ageEuropean Data Forum
 
EDF2012 Florian Bauer - Using LOD to share clean energy data and knowledge
EDF2012   Florian Bauer - Using LOD to share clean energy data and knowledgeEDF2012   Florian Bauer - Using LOD to share clean energy data and knowledge
EDF2012 Florian Bauer - Using LOD to share clean energy data and knowledgeEuropean Data Forum
 
EDF2012 Chris Taggart - How the biggest Open Database of Companies was built
EDF2012   Chris Taggart - How the biggest Open Database of Companies was builtEDF2012   Chris Taggart - How the biggest Open Database of Companies was built
EDF2012 Chris Taggart - How the biggest Open Database of Companies was builtEuropean Data Forum
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
EDF2012 Phil Archer - Enabling Open Data Interoperability
EDF2012   Phil Archer - Enabling Open Data InteroperabilityEDF2012   Phil Archer - Enabling Open Data Interoperability
EDF2012 Phil Archer - Enabling Open Data InteroperabilityEuropean Data Forum
 

Viewers also liked (9)

EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012  Simon Riggs - Open Data, Open Database: PostgreSQLEDF2012  Simon Riggs - Open Data, Open Database: PostgreSQL
EDF2012 Simon Riggs - Open Data, Open Database: PostgreSQL
 
EDF2012 Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...
EDF2012   Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...EDF2012   Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...
EDF2012 Irini Fundulaki - Abstract Access Control Models for Dynamic RDF Da...
 
EDF2012 Simon Redfern - Open Bank Project
EDF2012   Simon Redfern - Open Bank ProjectEDF2012   Simon Redfern - Open Bank Project
EDF2012 Simon Redfern - Open Bank Project
 
EDF2012 Mariana Damova - Factforge
EDF2012   Mariana Damova - FactforgeEDF2012   Mariana Damova - Factforge
EDF2012 Mariana Damova - Factforge
 
EDF2012 Andrew Farrow - (Copy)right information in the digital age
EDF2012   Andrew Farrow - (Copy)right information in the digital ageEDF2012   Andrew Farrow - (Copy)right information in the digital age
EDF2012 Andrew Farrow - (Copy)right information in the digital age
 
EDF2012 Florian Bauer - Using LOD to share clean energy data and knowledge
EDF2012   Florian Bauer - Using LOD to share clean energy data and knowledgeEDF2012   Florian Bauer - Using LOD to share clean energy data and knowledge
EDF2012 Florian Bauer - Using LOD to share clean energy data and knowledge
 
EDF2012 Chris Taggart - How the biggest Open Database of Companies was built
EDF2012   Chris Taggart - How the biggest Open Database of Companies was builtEDF2012   Chris Taggart - How the biggest Open Database of Companies was built
EDF2012 Chris Taggart - How the biggest Open Database of Companies was built
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
EDF2012 Phil Archer - Enabling Open Data Interoperability
EDF2012   Phil Archer - Enabling Open Data InteroperabilityEDF2012   Phil Archer - Enabling Open Data Interoperability
EDF2012 Phil Archer - Enabling Open Data Interoperability
 

Similar to EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Apache Nemo
Apache NemoApache Nemo
Apache Nemo
NAVER Engineering
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talk
boorad
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
DataStax
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 

Similar to EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere (20)

Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Data Science
Data ScienceData Science
Data Science
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talk
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 

More from European Data Forum

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
European Data Forum
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
European Data Forum
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
European Data Forum
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
European Data Forum
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
European Data Forum
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
European Data Forum
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
European Data Forum
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
European Data Forum
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
European Data Forum
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
European Data Forum
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
European Data Forum
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
European Data Forum
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
European Data Forum
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
European Data Forum
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
European Data Forum
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
European Data Forum
 

More from European Data Forum (20)

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
 
Barbato leit ict 15-16-17
Barbato leit ict 15-16-17Barbato leit ict 15-16-17
Barbato leit ict 15-16-17
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

  • 1. Analyzing and Linking Big Data with Stratosphere Kostas Tzoumas kostas.tzoumas@tu-berlin.de www.user.tu-berlin.de/kostas.tzoumas/
  • 2. Big Data ■ Driven by cheaper storage and computation □ Cloud computing further enabling economies of scale □ Open-source software lowers barrier of entry ■ Societal and economical impact □ Scientific breakthroughs will come from data exploration □ Success in business dictated by the ability to quickly draw insights from data signals ■ Big Data Analytics □ Analytical workloads scale inversely with cost ■ Major challenge: Information marketplaces □ Data as a resource, analytics as a product □ An “AppStore” for data 2 Linking and Analyzing Big Data with Stratosphere
  • 3. The DOPA Vision ■ Data Pools as collections of diverse data sets ■ Example data pools □ Evolving history of the Web □ Financial and statistical data ■ Data identification □ By assigning unique IDs to objects □ Enables data linkage ■ Data Analytics □ Integration and analytics on diverse data sources □ Using a common framework and language 3 Linking and Analyzing Big Data with Stratosphere
  • 4. Some Scenarios ■ Sentiment and market analysis □ An SME producing consumer goods can analyze blogs and social network streams, and link them with customer demographics to perform sentiment analysis and market research ■ Green houses □ A home buyer can find out the energy consumption and distribution over time in houses in a particular area by linking and analyzing energy with demographic data. ■ Traffic analysis, transportation and construction planning □ Linking weather, traffic, road data Linking and Analyzing Big Data with Stratosphere
  • 5. Big Data Analytics ■ “Big Data” refers to different applications as well as more data □ Beyond traditional DW queries ■ Open-source in data management □ Enabled primarily by Hadoop popularity □ Changed research landscape ■ New systems are rethinking the complete data management stack at a massively parallel scale □ As systems mature, need to tackle hard and novel problems 5 Linking and Analyzing Big Data with Stratosphere
  • 6. StratoSphere Above the Clouds STRATOSPHERE 6 Linking and Analyzing Big Data with Stratosphere
  • 7. Stratosphere ■ Collaborative Research Project □ 3 Universities, 5 research groups in the Berlin area ■ Infrastructure for Big Data Analytics ■ Bridge relational DBMSs and MapReduce worlds □ Intersection of functional languages and data parallelism □ Re-architecting data management systems for massive parallelism ■ Open-source research platform (Apache) ■ Used by a variety of Universities and research institutes 7 Linking and Analyzing Big Data with Stratosphere
  • 8. Stratosphere Architecture $res = filter $e in $emp where Data Data Data $e.income > 30000; pool pool pool Compiler Query Processor PACT Optimizer Nephele ... 8 Linking and Analyzing Big Data with Stratosphere
  • 9. 2km resolution Stratosphere Use Cases 10TB 1100km, 950km, 2km resolution 9 Linking and Analyzing Big Data with Stratosphere
  • 10. The Meteor Query Language ■ Stratosphere declarative front-end inspired by the IBM Jaql language ■ Extensible and flexible □ Easy to add libraries, e.g., for data linkage, cleansing, mining □ Easy to integrate in language syntax ■ Provides operators for Information Extraction and Data Linkage ■ Time as a first-class concept 10 Linking and Analyzing Big Data with Stratosphere
  • 11. The Nephele Execution Engine ■ Executes Nephele schedules □ DAGs of already parallelized operator instances □ Parallelization already done by PACT optimizer ■ Design decisions □ Designed to run on top of an IaaS cloud □ Predictable performance □ Scalability to 1000+ nodes with flexible fault-tolerance ■ Permits network, in-memory (both pipelined), file (materialization) channels 11 Linking and Analyzing Big Data with Stratosphere
  • 12. The PACT Programming Model ■ Internal Stratosphere programming model □ Also exposed to the programmer for advanced functionality ■ Dataflow, side-effect free programming model enabling massive parallelism ■ Centered around the concept of second-order functions □ Generalization of MapReduce Map PACT Reduce PACT Match PACT CoGroup PACT 12 Linking and Analyzing Big Data with Stratosphere
  • 13. Optimization ■ Knowledge of PACT signature permits automatic optimization ala Relational DBMSs ■ Emulates different hand-crafted MapReduce implementations ■ Enables orders of Reduce (on tid) ↑(pid=tid, r=∑ k)↑ Sum up partial ranks Reduce (on tid) ↑(pid=tid, r=∑ k)↑ magnitude faster fifo part./sort (tid) programs Match (on pid) ↑(tid, k=r*p)↑ Join P and A Match (on pid) ↑(tid, k=r*p)↑ ■ Frees programmer from buildHT (pid) probeHT (pid) probeHT (pid) buildHT (pid) thinking about broadcast part./sort (tid) partition (pid) partition (pid) execution p A p A (pid, r ) (pid, tid, p) (pid, r ) (pid, tid, p) 13 Linking and Analyzing Big Data with Stratosphere
  • 14. Ongoing Research Sink O Reduce (on tid) Sum up ↑(pid=tid, r=∑ k)↑ partial ranks UDF fifo Match Match (on pid) ↑(tid, k=r*p)↑ Join P and A UDF probeHT (pid) Reduce buildHT (pid) CACHE broadcast part./sort (tid) UDF UDF A (pid, tid, p) Map Map I fifo Src Src 1 2 p (pid, r ) 14 Linking and Analyzing Big Data with Stratosphere
  • 15. Summary ■ DOPA □ Bootstrapping the information economy by providing information marketplaces and related business models □ Brings together heterogeneous data pools □ Enables easy linkage and analytics across data pools via a flexible programming language ■ Stratosphere □ Technical infrastructure for scalable analytics □ Pushes the MapReduce paradigm forward □ Focal point of several research initiatives across Europe and the world 15 Linking and Analyzing Big Data with Stratosphere
  • 16. Acknowledgments ■ FP7 STREP (DOPA), DFG FOR, EIT (Stratosphere) ■ DOPA partners □ TU Berlin, IMR, DataMarket, OKKAM, Vico, ami ■ Stratosphere partners □ TU Berlin, HU Berlin, HPI ■ EIT partners □ TU Berlin, SICS, TU Delft, Inria, U. Trento, Aalto U., STACZKI 16 Linking and Analyzing Big Data with Stratosphere
  • 17. Thank you! www.stratosphere.eu @stratosphere_eu kostas.tzoumas@tu-berlin.de 17 Linking and Analyzing Big Data with Stratosphere