SlideShare a Scribd company logo
1 of 38
Hadoop and Graph Data Management:
   Challenges and Opportunities

                 Daniel Abadi
     Assistant Professor at Yale University
           Chief Scientist at Hadapt
            Tuesday, November 8th

      (Collaboration with Jiewen Huang)
Two Major Trends
 Hadoop
   Becoming de facto standard for large scale data
    processing
   Becoming more than just MapReduce
   Ecosystem growing rapidly --- lot’s of great tools
    around it
 Graph data is proliferating; huge value if analyzed
 and processed
     Social graphs
     Linked data
     Telecom
     Advertising
     Defense
Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
   Little automatic optimization (non-declarative
    interface)
   It’s possible to do graphs with Hadoop:
     Really badly
     Suboptimally
     Really well
   Case in point: subgraph pattern-matching
   benchmark on three different Hadoop-centered
   solutions:
     ~1000s,
     ~100s,
     < ~10s
Case study: Linked Data
Example Linked Data
 Entities (or “resources”)
  are nodes in the graph
 Relationships between
  entities are labeled
  directed edges in the
  graph
 (Resources typically have
  unique URI identifies to
  allow for global
  references --- not shown
  in example to the right)
 Any resource can
  connect to any other
  resource
Graph Representation
 Linked data graph
  can be parsed
  into a series of
  vertex-entity-
  vertex triples
 First entity
  referred to as the
  subject; second
  as the object;
  edge connecting
  them as the
  predicate
 We will call these
  “RDF triples”
Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
 equivalent to the subgraph pattern matching
 problem
Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
     Sesame
     Jena
     RDF-3X
     3store
 Fewer options that can scale across multiple
 machines
   This is where Hadoop comes in!
     One cool solution: SHARD (presented by Kurt Rohloff at
      HadoopWorld 2010)
       Uses HDFS to store graph, MapReduce jobs for
        subgraph pattern matching
       Much nicer than a naïve Hadoop solution
Another Example: Twitter Data
                               @joe_hellerst
                                   ein

                      follow                      follow
                      s                           s
                                follow
                                s
           @daniel_ab                               @mikeolso
              adi                                      n
                                 follow
                                 s                                  retweete
retweete
d                                                           retweeted
           retweete            retweete          retweete   d
           d                   d                 d

     @hadoop_is_my_life        @super_hadooper        @hadoop_is_the answer
Example Query over Twitter
 Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

      @daniel_aba
                               @mikeolson
          di



          retweeted          retweeted


                      ???
Slide From Kurt Rohloff Presentation at
         HadoopWorld 2010
Slide From Kurt
    Rohloff
Presentation at
 HadoopWorld
     2010
Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
 between MapReduce jobs
   SHARD algorithm matches each clause of SPARQL
   subgraph in a separate MapReduce job
     Need full barrier synchronization between jobs which is
      unnecessary
 Hadoop does not default to pipelined algorithms
 between Map and Reduce
   Each job performs a join by vertex, where the object
    of one triple is joined with the subject of another
    triple
   Joins work much faster if you can pipeline between
    Map and Reduce (e.g. pipelined hash join)
Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
   Data in graph is randomly partitioned across nodes
   Data close in the graph could be on a physically
    distant node
   Therefore each join requires a complete
    redistribution of the entire data set across the
    network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
  with no preferences for what is replicated or
  where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
  File System (HDFS) for storage, which is not
  optimized for graph data
All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
 to it a little
System Architecture
Partitioning
 Graphs can be represented as vertex1-edge-
  vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type        footballer      .
          ?player name      ?name           .
          ?player position striker          .
          ?player playsFor FC_Barcelona . }
The problem with hash partitioning
 …

Find football players playing for clubs in a populous region where he was born.
Graph Partitioning
 Data close together in the graph should be
  physically close together (preferably on the same
  machine)
 Subgraph pattern matching can be done without
  require huge amounts of communication via joins
  across the cluster
Graph Partitioning
Graph Partitioning




   Machine 1   Machine 2   Machine 3
Edge/Triple Placement
●   Minimizing data exchange
    ●   Allowing data overlap
●   N-hop guarantee
    ●   The extent of data overlap
    ●   If a vertex is assigned to a machine, any
        vertex that is within n-hop of this vertex is
        also stored in this machine
0 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
0 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
High Degree Vertexes
●   Problem: High-degree vertexes make the
    graph well-connected and difficult to
    partition
●   Solution: Ignore them during graph
    partitioning

●   Problem: High-degree vertexes cause
    data explosion with a n-hop guarantee
●   Solution: Selectively weaken the n-hop
    guarantee
Query Processing
  ●   Query execution is more efficient if
      pushed to optimized storage (RDF-
      stores)
      ●   Minimizing the number of Hadoop jobs
      ●   The larger the hop guarantee, the more
          work is done in RDF-stores
To Exchange, or not to Exchange?
  ●   Given a query and n-hop guarantee, is
      data exchange (Hadoop job) between
      nodes needed?
      ●   Choose the “center” of the query graph
      ●   Calculate the distance from the “center” to
          the furthest edge
      ●   If distance > n, data exchange is needed;
          not needed otherwise
Data Exchange

Find football players playing for clubs in a populous region where he was born.
Experimental Setup
●   20-machine cluster
●   Leigh University Benchmark (LUBM):
    270 million triples
●   Things to compare:
    ●   Single-node RDF-3X
    ●   SHARD
    ●   Graph partitioning (the proposed system)
    ●   Hash partitioning on subjects
Performance Comparison
Speedup
●   Better than linear speedup
Analysis
 Difference between fastest implementation and
 slowest implementation was a factor of 1340!
   Using Hadoop does not mean that performance is
   fixed
 More improvements are possible
   Experiments used MapReduce whenever data
   communication was necessary
     NextGen Hadoop allows other programming paradigms
     besides MR
      MPI is a good candidate

   Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
 - similar in theme to Hadapt
How this fits with Hadapt

                                        Full SQL interface, Map
                                         Reduce, and JDBC Connector
  Flexible Query Interface
    (Full SQL Support, MR, JDBC)
                                        10x-50x faster than Hadoop and
                                         Hive
                                          Queries go from hours to minutes,
                   Split Query             and minutes to seconds
   Hadoop          Execution
                    (Patent Pending)
                                        Analytics across structured and
                                         unstructured data in one
                                         platform
   Hadapt Storage Engine
        (Relational + HDFS)
                                        3.5 Patents Pending

                                        $9.5 Series A financing, lead by
                                         Norwest Venture Partners and
                                         Bessemer Venture Partners
Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
 Hadoop (amongst other things)




 Bottom line: Hadoop can be used for scalable
 graph processing, but it might need some
 Hadapting ;)

More Related Content

What's hot

Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
The Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDThe Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDShu-Jeng Hsieh
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Ted Dunning
 

What's hot (20)

Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
RHadoop
RHadoopRHadoop
RHadoop
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
The Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDThe Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICD
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 

Similar to Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt

Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 

Similar to Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt (20)

Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt

  • 1. Hadoop and Graph Data Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  • 2. Two Major Trends  Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it  Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  • 3. Use Hadoop to Process Graph Data?  Of course!  BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  • 5. Example Linked Data  Entities (or “resources”) are nodes in the graph  Relationships between entities are labeled directed edges in the graph  (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right)  Any resource can connect to any other resource
  • 6. Graph Representation  Linked data graph can be parsed into a series of vertex-entity- vertex triples  First entity referred to as the subject; second as the object; edge connecting them as the predicate  We will call these “RDF triples”
  • 7. Querying Linked Data  Linked Data is typically queried in SPARQL  Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  • 8. Scalable SPARQL Processing  Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store  Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  • 9. Another Example: Twitter Data @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweete retweete d retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  • 10. Example Query over Twitter Graph Who has retweeted both @daniel_abadi and @mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  • 11. Slide From Kurt Rohloff Presentation at HadoopWorld 2010
  • 12. Slide From Kurt Rohloff Presentation at HadoopWorld 2010
  • 13. Issues With Hadoop Defaults  Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary  Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  • 14. Issues with Hadoop Defaults (cont.)  Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins)  Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to  Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  • 15. All is not lost! Don’t throw away Hadoop!  All we have to do is change the defaults and add to it a little
  • 17. Partitioning  Graphs can be represented as vertex1-edge- vertex2 triples  Hash partitioning by vertex1 is straightforward  Great for queries like: Query: Find the names of the strikers that play for FC Barcelona. SELECT ?name WHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  • 18. The problem with hash partitioning … Find football players playing for clubs in a populous region where he was born.
  • 19. Graph Partitioning  Data close together in the graph should be physically close together (preferably on the same machine)  Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  • 21. Graph Partitioning Machine 1 Machine 2 Machine 3
  • 22. Edge/Triple Placement ● Minimizing data exchange ● Allowing data overlap ● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  • 23. 0 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 24. 1 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 25. 2 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 26. 0 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 27. 1 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 28. 2 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 29. High Degree Vertexes ● Problem: High-degree vertexes make the graph well-connected and difficult to partition ● Solution: Ignore them during graph partitioning ● Problem: High-degree vertexes cause data explosion with a n-hop guarantee ● Solution: Selectively weaken the n-hop guarantee
  • 30. Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  • 31. To Exchange, or not to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  • 32. Data Exchange Find football players playing for clubs in a populous region where he was born.
  • 33. Experimental Setup ● 20-machine cluster ● Leigh University Benchmark (LUBM): 270 million triples ● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  • 35. Speedup ● Better than linear speedup
  • 36. Analysis  Difference between fastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed  More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem  Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  • 37. How this fits with Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  • 38. Optimized Storage Matters  HDFS appropriate for unstructured data  Relational storage appropriate for relational data  Graph storage appropriate for graph data  Hadapt allows for pluggable storage inside Hadoop (amongst other things)  Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)