Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pixels Camp 2017 - Stories from the trenches of building a data architecture

686 views

Published on

We live in a Data-centric era. Nowadays we have at our disposal an enormous variety of services using data. Behind those services there are architectures supporting the flowing and processing of that data. BinaryEdge.io is no exception. Supporting our platform, we have a data architecture processing 1000s of events per second, which was built and is currently maintained by us. In this talk we are going to review the parts that compose a data architecture, and discuss which tools can be used at each step to arrive at a functional architecture. Note that the insights given will not be based of theoretical documents or truckloads of years of experience, but on our own experience of building and maintaining a large scale data infrastructure and architecture

Published in: Technology
  • Be the first to comment

Pixels Camp 2017 - Stories from the trenches of building a data architecture

  1. 1. BinaryEdge.io Be Ready. Be Safe. Be Secure. Florentino Bexiga Stories from the Trenches of Building a Data Architecture Data Engineer/ Platform Developer fb@binaryedge.io
  2. 2. WHO WE ARE AND WHAT WE DO VNC RDP Files People Social Company registration internal external Phone Email Linked urls BGP AS Whois AS membership AS peer List of IPs Shared infrastructure Co-hosted sites Contact Geolocation Office locations Social networks Phone portscan dns Screenshots Web Services http https Users AppsFiles Banners Image Classifier Vulnerabilities DATA POINTS metadata Photos Family&friends Behaviour Likes Topics Search News Forums Sub-reddits Domains AXFR MX records Webserver Framework Headers Cookies Certificate Configuration Authorities Entities OCR SW ip address url address SMB torrents peers torrent name categorysource hashes of files
  3. 3. AGENDA 01 02 THE NEED OF A DATA ARCHITECTURE 03 SIMPLE ARCHITECTURE OVERVIEW 04 05 MESSAGE QUEUE STREAM PROCESSING 06 BATCH PROCESSING 07 DATABASES 08 BONUS ROUND: MANAGEMENT 09 ARCHITECTURE REVISITED 10 CLOUD-BASED ARCHITECTURES THE BASIC SURVIVAL KIT
  4. 4. THE NEED OF A DATA ARCHITECTURE Rules before building a data architecture Typical list of needs Think about what you need to do with the data There are no more rules Gather a lot of data coming from different places Process that data in (close to) real-time Make data available in multiple formats Provide ways to easily process that data
  5. 5. SIMPLE ARCHITECTURE OVERVIEW SENSOR STREAM PRO- CESSING SENSOR SENSOR DATA SINK MESSAGE QUEUE FILE STORAGE BATCH PROCESSING DATABASES APIs PORTALS
  6. 6. THE BASIC SURVIVAL KIT Apache Hadoop MapReduce HDFS Yarn Why Apache Hadoop? Interoperability with many other tools Great community Gets the job done THE BASIC SURVIVAL KIT
  7. 7. THE BASIC SURVIVAL KIT YARN Available resources per node for processing Timeouts Heap, heap... HDFS Same as above Primary/ Secondary nodes - high availability Points of attention
  8. 8. MESSAGE QUEUE Apache Kafka Originally developed by LinkedIn Massively scalable publish/ subscribe message queue High troughout Low latency Concepts Topics Consumers Consumer groups Partitions Replicas
  9. 9. MESSAGE QUEUE Points of attention Timeouts Message sizes Retention logs vs cleanup interval !!!! Also, do not, for the love of god, simply delete all the subdirectories in your“kafka-logs”directory, you will cry.
  10. 10. STREAM PROCESSING vs. vs.
  11. 11. STREAM PROCESSING The good parts Very simple programming model and APIs Multilanguage support Points of attention Mini-batch processing, not real stream Heavy resource fingerprint Prone to timeouts of memory errors Hard to fine-tune to get the right performance DataFrame API ML Libraries Wide community Wide range of addons
  12. 12. STREAM PROCESSINGSTREAM PROCESSING The good parts Stream processing Multilanguage support Points of attention Slightly more complex programming model Some support for other languages Works without much configuration effort Low resources configuration Wide community Lots of connectors and addons Great performance, like,“The flash”great
  13. 13. STREAM PROCESSINGSTREAM PROCESSING The good parts Stream processing Multilanguage support Buuuuut..... Does not have a wide community Does not have that many connectors and addons Simple API (very similar to Spark) Dataset API ML Libraries Good handling of resources Low configuration/ optimisation overhead
  14. 14. BATCH PROCESSING Apache Spark Multilanguage support Simple API DataFrame API ML Libraries Wide community Wide range of addons Apache Flink The good parts Multilanguage support Simple API (very similar) DataSet API ML Libraries
  15. 15. BATCH PROCESSING Apache Spark Heavy resource fingerprint Prone to timeouts of memory errors Hard to fine-tune to get the right performance Apache Flink Points of attention Less configuration problems Better handling of resources Not a big community Not many addons
  16. 16. DATABASES Before commiting to a database 01 Think about how you need to access the data 02 Read 1 again 03 Seriously, read 1 again Select a database, based on your needs, i.e.: Hardcore read/ write workload and not much advanced querying: HBase Heavy read/ write workload and minimally dynamic querying: Cassandra Advanced text querying and not such heavy read/ write workload: something else
  17. 17. BONUS ROUND: MANAGEMENT Apache Ambari Provision a Hadoop Cluster Manage a Hadoop Cluster Monitor a Hadoop Cluster Ambari uses Hadoop ecosystem distributions such as: Hortonworks Cloudera
  18. 18. ARCHITECTURE REVISITED SENSOR APACHE STORM SENSOR SENSOR DATA SINK APACHE KAFKA APACHE HDFS APACHE SPARK APACHE HBASE/ CASSANDRA APIs PORTALS
  19. 19. CLOUD BASED ARCHITECTURES Pros Less configuration overhead Less maintenance overhead Easily scalable Reliable Return focus back to data and product Cons $$$$$$$$$$
  20. 20. CLOUD BASED ARCHITECTURES SENSOR GOOGLE DATAFLOW SENSOR SENSOR DATA SINK GOOGLE PUBSUB GOOGLE CLOUD STORAGE GOOGLE DATAPROC APIs PORTALS GOOGLE BIGTABLE/ BIGQUERY
  21. 21. CLOUD BASED ARCHITECTURES SENSOR AMAZON DATA PIPELINE SENSOR SENSOR DATA SINK AMAZON SIMPLE QUEUE SERVICE AMAZON S3 AMAZON ELASTIC MAPREDUCE APIs PORTALS AMAZON DYNAMODB/ REDSHIFT
  22. 22. BE READY. BE SAFE. BE SECURE. BinaryEdge AG Freigutstrasse 40, 8001 Zurich Switzerland info@binaryedge.io www.binaryedge.io + 41 78 713 40 00 CONTIGENCY THREAT SAFE IRRELEVANT

×