Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

FIWARE Tech Summit - FIWARE Cygnus and STH-Comet

425 views

Published on

Presentation by Joaquín Salvachúa
IT Professor, UPM

FIWARE Tech Summit
28-29 November, 2017
Malaga, Spain

Published in: Technology
  • Be the first to comment

  • Be the first to like this

FIWARE Tech Summit - FIWARE Cygnus and STH-Comet

  1. 1. FIWARE Big Data ecosystem : Cygnus and STH-Comet Joaquin Salvachua Andres Muñoz Universidad Politécnica de Madrid Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE www.slideshare.net/jsalvachua
  2. 2. 1
  3. 3. Big Data analytics 2
  4. 4. Batch Processing 3
  5. 5. Lambda Architecture 4
  6. 6. FIWARE Architecture 5
  7. 7. Cygnus • Persistence (collecting, aggregating and moving data ) for later Batch processing. • Could be integrated into a lambda architecture • Quite flexible and configurable: based on stream data flows with a pub-sub like comunication model. 6
  8. 8. CYGNUS • What is it for? – Cygnus is a connector in charge of persisting Orion context data in certain configured third-party storages, creating a historical view of such data. In other words, Orion only stores the last value regarding an entity's attribute, and if an older value is required then you will have to persist it in other storage, value by value, using Cygnus. • How does it receives context data from Orion Context Broker? – Cygnus uses the subscription/notification feature of Orion. A subscription is made in Orion on behalf of Cygnus, detailing which entities we want to be notified when an update occurs on any of those entities attributes. 7
  9. 9. 8
  10. 10. Cygnus • Cygnus is a connector in charge of persisting certain sources of data in certain configured third-party storages, creating a historical view of such data. • Internally, Cygnus is based on Apache Flume, data collection and persistence agents. – An agent is basically composed of a listener or source in charge of receiving the data, a channel where the source puts the data once it has been transformed into a Flume event, and a sink, which takes Flume events from the channel in order to persist the data within its body into a third-party storage. 9
  11. 11. Cygnus Architecture • Cygnus runs Flume agents. Thus, Cygnus agents architecture is Flume agents one. 10
  12. 12. Data Sinks • NGSI-like context data in: – HDFS, the Hadoop distributed file system. – MySQL, the well-know relational database manager. – CKAN, an Open Data platform. – MongoDB, the NoSQL document-oriented database. – STH Comet, a Short-Term Historic database built on top of MongoDB. – Kafka, the publish-subscribe messaging broker. – DynamoDB, a cloud-based NoSQL database by Amazon Web Services. – PostgreSQL, the well-know relational database manager. – Carto, the database specialized in geolocated data. • Twitter data in: – HDFS, the Hadoop distributed file system. 11
  13. 13. Cygnus events • A Source consumes Events having a specific format, and those Events are delivered to the Source by an external source like a web server. For example, an AvroSource can be used to receive Avro Events from clients or from other Flume agents in the flow. When a Source receives an Event, it stores it into one or more Channels. The Channel is a passive store that holds the Event until that Event is consumed by a Sink. One type of Channel available in Flume is the FileChannel which uses the local filesystem as its backing store. A Sink is responsible for removing an Event from the Channel and putting it into an external repository like HDFS (in the case of an HDFSEventSink) or forwarding it to the Source at the next hop of the flow. The Source and Sink within the given agent run asynchronously with the Events staged in the Channel. 12
  14. 14. Cygnus Configuration examples • https://github.com/telefonicaid/fiware- cygnus/blob/master/doc/cygnus- ngsi/installation_and_administration_guide/confi guration_examples.md 13
  15. 15. Multiple persistence backends 14
  16. 16. Multiple Agents • One instance for each Agent. • This add more capability to the system 15
  17. 17. Connecting Orion Context Broker and Cygnus • Cygnus takes advantage of the subscription-notification mechanism of Orion Context Broker. Specifically, Cygnus needs to be notified each time certain entity's attributes change, and in order to do that, Cygnus must subscribe to those entity's attribute changes. 16
  18. 18. Default Sinks 17
  19. 19. 18
  20. 20. 19
  21. 21. Basic Cygnus agent 20
  22. 22. Configure a basic Cygnus agent 21 • Edit /usr/cygnus/conf/agent_<id>.conf • List of sources, channels and sinks: cygnusagent.sources = http-source cygnusagent.sinks = hdfs-sink cygnusagent.channels = hdfs-channel • Channels configuration cygnusagent.channels.hdfs-channel.type = memory cygnusagent.channels.hdfs-channel.capacity = 1000 cygnusagent.channels.hdfs-channel. transactionCapacity = 100
  23. 23. Configure a basic Cygnus agent 22 • Sources configuration: cygnusagent.sources.http-source.channels = hdfs-channel cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource cygnusagent.sources.http-source.port = 5050 cygnusagent.sources.http-source.handler = es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandler cygnusagent.sources.http-source.handler.notification_target = /notify cygnusagent.sources.http-source.handler.default_service = def_serv cygnusagent.sources.http-source.handler.default_service_path = def_servpath cygnusagent.sources.http-source.handler.events_ttl = 10 cygnusagent.sources.http-source.interceptors = ts de cygnusagent.sources.http-source.interceptors.ts.type = timestamp cygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationEx tractor$Builder cygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf
  24. 24. Configure a basic Cygnus agent 23 • Sinks configuration: cygnusagent.sinks.hdfs-sink.channel = hdfs-channel cygnusagent.sinks.hdfs-sink.type = es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSink cygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi- ware.org cygnusagent.sinks.hdfs-sink.cosmos_port = 14000 cygnusagent.sinks.hdfs-sink.cosmos_default_username = cosmos_username cygnusagent.sinks.hdfs-sink.cosmos_default_password = xxxxxxxxxxxxx cygnusagent.sinks.hdfs-sink.hdfs_api = httpfs cygnusagent.sinks.hdfs-sink.attr_persistence = column cygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi- ware.org cygnusagent.sinks.hdfs-sink.hive_port = 10000 cygnusagent.sinks.hdfs-sink.krb5_auth = false
  25. 25. HDFS details regarding Cygnus persistence 24 • By default, for each entity Cygnus stores the data at: – /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>- <entity-type>.txt • Within each HDFS file, the data format may be json-row or json-column: – json-row { "recvTimeTs":"13453464536”, "recvTime":"2014-02-27T14:46:21”, "entityId":"Room1”, "entityType":"Room”, "attrName":"temperature”, "attrType":"centigrade”, “attrValue":"26.5”, "attrMd":[ … ] } – json-column { "recvTime":"2014-02-27T14:46:21”, "temperature":"26.5”, "temperature_md":[ … ], “pressure”:”90”, “pressure_md”:[ … ] }
  26. 26. High Availability • Simple configuration: – implementing HA for Flume/Cygnus is as easy as running two instances of the software and putting a load balancer in between them and the data source (or sources). • Use File Channels instead of Memory Channels (extra persistence) which is the default. • Advanced configuration: – Flume with Zookeeper • https://github.com/telefonicaid/fiware-cygnus/blob/master/doc/cygnus-ngsi/installation_and_administration_guide/reliability.md 25
  27. 27. STH-Comet 26
  28. 28. 27
  29. 29. 28
  30. 30. 29
  31. 31. 30
  32. 32. 31
  33. 33. Architecture 32
  34. 34. Data schemas and pre-aggregation • Although the STH stores the evolution of (raw) data (i.e., attributes values) in time, its real power comes from the storage of aggregated data • The STH should be able to respond to queries such as: – Give me the maximum temperature of this room during the last month (range) aggregated by day (resolution) – Give me the mean temperature of this room today (range) aggregated by hour or even minute (resolution) – Give me the standard deviation of the temperature of this room this last year (range) aggregated by day (resolution) – Give me the number of times the air conditioner of this room was switched on or off last Monday (range) aggregated by hour 33
  35. 35. Data schemas and pre-aggregation 34
  36. 36. API : get raw data 35
  37. 37. Pagination 36
  38. 38. Response 37
  39. 39. Aggregated data retrieval 38
  40. 40. Response 39
  41. 41. Attribute data removal 40
  42. 42. Log level retrieval & update 41
  43. 43. Configuration 42
  44. 44. Configuration : environment variables 43
  45. 45. Configuration : environment variables 44
  46. 46. Configuration : environment variables 45
  47. 47. Usage and installation Installation – Git clone https://github.com/ging/fiware-sth-comet – Npm install • Docker – Docker pull fiware/sth-comet – Docker run –t –i fiware/sth-comet • Running – Fiware-sth-comet> ./bin/sth46
  48. 48. FIWARE Architecture 47
  49. 49. Any Questions 48
  50. 50. Extra documentation • The per agent Quick Start Guide found at readthedocs.org provides a good documentation summary (cygnus-ngsi, cygnus-twitter). • Nevertheless, both the Installation and Administration Guide and the User and Programmer Guide for each agent also found at readthedocs.org cover more advanced topics. • The per agent Flume Extensions Catalogue completes the available documentation for Cygnus (cygnus-ngsi, cygnus-twitter). • Other interesting links are: • Our Apiary Documentation if you want to know how to use our API methods for Cygnus. • cygnus-ngsi integration examples . • cygnus-ngsi introductory course in FIWARE Academy. 49
  51. 51. Round Robin channel selection 50 • It is possible to configure more than one channel-sink pair for each storage, in order to increase the performance • A custom ChannelSelector is needed • https://github.com/telefonicaid/fiware- connectors/blob/master/flume/doc/operation/performance_tuning _tips.md
  52. 52. RoundRobinChannelSelector configuration 51 cygnusagent.sources = mysource cygnusagent.sinks = mysink1 mysink2 mysink3 cygnusagent.channels = mychannel1 mychannel2 mychannel3 cygnusagent.sources.mysource.type = ... cygnusagent.sources.mysource.channels = mychannel1 mychannel2 mychannel3 cygnusagent.sources.mysource.selector.type = es.tid.fiware.fiwareconnectors.cygnus.channelselectors. RoundRobinChannelSelector cygnusagent.sources.mysource.selector.storages = N cygnusagent.sources.mysource.selector.storages.storage1 = <subset_of_cygnusagent.sources.mysource.channels> ... cygnusagent.sources.mysource.selector.storages.storageN = <subset_of_cygnusagent.sources.mysource.channels>
  53. 53. Pattern-based Context Data Grouping 52 • Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a concatenation: – destination=<entity_id>-<entityType> • It is possible to group different context data thanks to this regex-based feature implemented as a Flume interceptor: cygnusagent.sources.http-source.interceptors = ts de cygnusagent.sources.http-source.interceptors.ts.type = timestamp cygnusagent.sources.http-source.interceptors.de.type = es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtract or$Builder cygnusagent.sources.http-source.interceptors.de.matching_table = /usr/cygnus/conf/matching_table.conf
  54. 54. Matching table for pattern-based grouping 53 • CSV file (‘|’ field separator) containing rules – <id>|<comma-separated_fields>|<regex>|<destination>|<destination_dataset> • For instance: 1|entityId,entityType|Room.(d*)Room|numeric_rooms|rooms 2|entityId,entityType|Room.(D*)Room|character_rooms|rooms 3|entityType,entityId|RoomRoom.(D*)|character_rooms|rooms 4|entityType|Room|other_roorms|rooms • https://github.com/telefonicaid/fiware- connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor
  55. 55. Kerberos authentication 54 • HDFS may be secured with Kerberos for authentication purposes • Cygnus is able to persist on kerberized HDFS if the configured HDFS user has a registered Kerberos principal and this configuration is added: cygnusagent.sinks.hdfs-sink.krb5_auth = true cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file = /usr/cygnus/conf/krb5_login.conf cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file = /usr/cygnus/conf/krb5.conf • https://github.com/telefonicaid/fiware- connectors/blob/master/flume/doc/operation/hdfs_kerberos_authe ntication.md
  56. 56. Thank you! http://fiware.org Follow @FIWARE on Twitter
  57. 57. FIWARE Big Data ecosystem : Cygnus and STH-Comet Joaquin Salvachua Andres Muñoz Universidad Politécnica de Madrid (UPM) Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE www.slideshare.net/jsalvachua

×