Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

980 views

Published on

There are many frameworks that can offer real time on top of Hadoop. This talk will show you the usage of Pivotal HAWQ and how it is easy to use SQL for querying your Hadoop data. Come and see the power and easy of use that can help you on using the Hadoop ecosystem.

Published in: Software
  • Login to see the comments

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

  1. 1. Paris 2015
  2. 2. • CTO for Zenika • In charge of BigData/NoSQL consulting/training • Trainer • Pleasant guy
  3. 3. Hadoop reminder1 2 3 HAWQ – SQL on Hadoop PXF – Accessing sources 4 Demo – Tweets Analytics
  4. 4. Hadoop Reminder
  5. 5. Storage • Semi-Structured • Unstructured • Large files • Large amount of data • Write once, Read many Process • Processing large amount of data in parallel • Commodity hardware • Derived from functional programming
  6. 6. • Provides high-throughput access to data blocks • Provides limited interface for managing the file system to allow it to scale • Creates multiple replicas of each data block • Distributes them on computers throughout the cluster to enable reliable and rapid data access
  7. 7. datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode namenode JVM NameNode NameNode (single) • Manages file-system content tree • Manages file & directory meta-data • Manages datanodes and blocks they hold DataNode (multiple) • Store & retrieve data blocks (64Mb/128Mb) • Report block usage to namenode
  8. 8. datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode namenode JVM NameNode A D B A C B B C C D A D File: input/logfiles/2014-12-12.log (200 mb) Requires 4 blocks (A,B,C,D) spread across data nodes Replicated again on blocks: 77,88,10,20 Replicated on blocks: 33,99,55,111 Stored on blocks: 11,22,44,66
  9. 9. namenode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode JVM NameNode A D B A C B B C C D A D FAILURE
  10. 10. • Performs distributed data processing using the MapReduce programming paradigm • Allows to possess user-defined map phase, which is a parallel, share-nothing processing of input • Creates multiple replicas of each data block • Distributes them on computers throughout the cluster to enable reliable and rapid data access
  11. 11. TaskTracker JVM TaskTracker TaskTracker JVM TaskTracker TaskTracker JVM TaskTracker TaskTracker JVM TaskTracker TaskTracker JVM TaskTracker TaskTracker JVM TaskTracker JobTracker JVM JobTracker JobTracker (single) • Launch and manages Jobs TaskTracker (multiple) • Run individual tasks (Mappers/Reducers) • Reside on the DataNodes
  12. 12. Hawq SQL on Hadoop
  13. 13. • HAdoop With Queries ? • Implementation of PostgreSQL • Use HDFS • Alternative to HIVE querying • Supports ANSI SQL-92 and analytic extensions from SQL-2003 • Cost-based parallel query optimiser
  14. 14. • ODP – Standardize Hadoop Ecosystem • ODP Core for building a versionned, packaged, tested set of Hadoop components • Developing a platform • Pivotal and Hortonworks alliance to simplify adoption • Joint engineering efforts • Support services • HAWQ Open Sourced
  15. 15. Network Master Worker Worker Worker
  16. 16. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF
  17. 17. HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF • Based on PostgreSQL • Handle SQL commands • Maintains global system catalog • Contains no Data
  18. 18. HAWQ Segment Host Query Executor PXF • Process partition of Query • Based on PostgreSQL • Stateless • Manage communication with NameNode • User/Table data stored in HDFS files
  19. 19. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF Clients JDBC/ODBC SQL
  20. 20. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF Gather Motion Sort HashAggregate HashJoin Redistribute Motion HashJoin Seq Scan on lineitem Hash Seq Scan on orders Hash HashJoin Seq Scan on customer Hash Broadcast Motion Seq Scan on nation
  21. 21. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name) ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name) ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  22. 22. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF
  23. 23. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF
  24. 24. Network Interconnect HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF HAWQ Segment Host Query Executor PXF HDFS NameNode HAWQ Segment Host Query Executor PXF HAWQ Segment Host Query Executor PXF
  25. 25. Pivotal HD 0 20 40 60 80 100 120 111 / 111 20 / 111 31 / 111 http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf
  26. 26. PXF Accessing sources
  27. 27. • Allows access to Hadoop (HDFS files, HBase, Hive) as external tables • Allows joins between HAWQ (internal) & external tables • Integrating with Third Party systems (Cassandra, Accumulo) • Provides extensible framework API to enable custom development for other data sources
  28. 28. HDFS HBase Hive Xtension Framework
  29. 29. Fragmenter • Get the locations of fragments for a table Accessor • Understand and read fragment, return records to the Resolver Resolver • Convert the records into a SQL engine format Analyser • Provide source stats to the Query optimizer
  30. 30. Demo Tweets Analytics
  31. 31. PHD (or any ODP Core-based Hadoop Distribution) HDFS HAWQ (SQL on Hadoop) SpringXD (Stream Processing/scoring) Direct Store https://github.com/spring-projects/spring-xd-samples/tree/master/analytics-dashboard
  32. 32. HDFS Xtension Framework (Json-ext) http://pivotal-field-engineering.github.io/pxf-field/json.html
  33. 33. stream create tweets --definition "twitterstream | hdfs --idleTimeout=3000 -- fileExtension=json" stream create tweetlang --definition "tap:stream:tweets > field-value-counter -- fieldName=lang" --deploy stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" - -deploy stream create tagcount --definition "tap:stream:tweets > field-value-counter -- fieldName=entities.hashtags.text --name=hashtags" --deploy stream deploy tweets
  34. 34. <profile> <name>JSON</name> <description>A profile for JSON data, one JSON record per line</description> <plugins> <fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter> <accessor>com.pivotal.pxf.plugins.json.JsonAccessor</accessor> <resolver>com.pivotal.pxf.plugins.json.JsonResolver</resolver> <analyzer>com.pivotal.pxf.plugins.hdfs.HdfsAnalyzer</analyzer> </plugins> </profile>
  35. 35. CREATE EXTERNAL TABLE ext_tweets_json ( created_at TEXT, id_str TEXT, text TEXT, source TEXT, "user.id" INTEGER, "user.location" TEXT, "coordinates.coordinates[0]" DOUBLE PRECISION, "coordinates.coordinates[1]" DOUBLE PRECISION) LOCATION('pxf://pivhdsne:50070/xd/tweets/*.json?PROFILE=JSON') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

×