Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BinaryEdge.io
Be Ready. Be Safe. Be Secure.
Florentino Bexiga
Stories from the Trenches of
Building a Data Architecture
Da...
WHO WE ARE AND WHAT WE DO
VNC
RDP
Files People
Social
Company
registration
internal
external
Phone
Email
Linked urls
BGP
A...
AGENDA
01
02
THE NEED OF A DATA ARCHITECTURE
03
SIMPLE ARCHITECTURE OVERVIEW
04
05
MESSAGE QUEUE
STREAM PROCESSING
06 BATC...
THE NEED OF A DATA ARCHITECTURE
Rules before building a data architecture Typical list of needs
Think about what you need ...
SIMPLE ARCHITECTURE OVERVIEW
SENSOR
STREAM PRO-
CESSING
SENSOR
SENSOR
DATA SINK
MESSAGE
QUEUE
FILE
STORAGE
BATCH
PROCESSIN...
THE BASIC SURVIVAL KIT
Apache Hadoop
MapReduce
HDFS
Yarn
Why Apache Hadoop?
Interoperability with many other tools
Great c...
THE BASIC SURVIVAL KIT
YARN
Available resources per node for processing
Timeouts
Heap, heap...
HDFS
Same as above
Primary/...
MESSAGE QUEUE
Apache Kafka
Originally developed by LinkedIn
Massively scalable publish/ subscribe message queue
High troug...
MESSAGE QUEUE
Points of attention
Timeouts
Message sizes
Retention logs vs cleanup interval !!!!
Also, do not, for the lov...
STREAM PROCESSING
vs. vs.
STREAM PROCESSING
The good parts
Very simple programming model and APIs
Multilanguage support
Points of attention
Mini-bat...
STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Multilanguage support
Points of attention
Slightly mor...
STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Multilanguage support
Buuuuut.....
Does not have a wid...
BATCH PROCESSING
Apache Spark
Multilanguage support
Simple API
DataFrame API
ML Libraries
Wide community
Wide range of add...
BATCH PROCESSING
Apache Spark
Heavy resource fingerprint
Prone to timeouts of memory errors
Hard to fine-tune to get the r...
DATABASES
Before commiting to a database
01 Think about how you need to access the data
02 Read 1 again
03 Seriously, read...
BONUS ROUND: MANAGEMENT
Apache Ambari
Provision a Hadoop Cluster
Manage a Hadoop Cluster
Monitor a Hadoop Cluster
Ambari u...
ARCHITECTURE REVISITED
SENSOR
APACHE
STORM
SENSOR
SENSOR
DATA SINK
APACHE
KAFKA
APACHE
HDFS
APACHE
SPARK
APACHE HBASE/
CAS...
CLOUD BASED ARCHITECTURES
Pros
Less configuration overhead
Less maintenance overhead
Easily scalable
Reliable
Return focus...
CLOUD BASED ARCHITECTURES
SENSOR
GOOGLE
DATAFLOW
SENSOR
SENSOR
DATA SINK
GOOGLE
PUBSUB
GOOGLE CLOUD
STORAGE
GOOGLE
DATAPRO...
CLOUD BASED ARCHITECTURES
SENSOR
AMAZON DATA
PIPELINE
SENSOR
SENSOR
DATA SINK
AMAZON SIMPLE
QUEUE SERVICE
AMAZON S3
AMAZON...
BE READY. BE SAFE. BE SECURE.
BinaryEdge AG
Freigutstrasse 40,
8001 Zurich
Switzerland
info@binaryedge.io
www.binaryedge.i...
Upcoming SlideShare
Loading in …5
×

of

Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 1 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 2 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 3 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 4 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 5 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 6 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 7 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 8 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 9 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 10 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 11 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 12 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 13 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 14 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 15 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 16 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 17 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 18 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 19 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 20 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 21 Pixels Camp 2017 - Stories from the trenches of building a data architecture Slide 22
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Pixels Camp 2017 - Stories from the trenches of building a data architecture

Download to read offline

We live in a Data-centric era. Nowadays we have at our disposal an enormous variety of services using data. Behind those services there are architectures supporting the flowing and processing of that data. BinaryEdge.io is no exception. Supporting our platform, we have a data architecture processing 1000s of events per second, which was built and is currently maintained by us. In this talk we are going to review the parts that compose a data architecture, and discuss which tools can be used at each step to arrive at a functional architecture. Note that the insights given will not be based of theoretical documents or truckloads of years of experience, but on our own experience of building and maintaining a large scale data infrastructure and architecture

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Pixels Camp 2017 - Stories from the trenches of building a data architecture

  1. 1. BinaryEdge.io Be Ready. Be Safe. Be Secure. Florentino Bexiga Stories from the Trenches of Building a Data Architecture Data Engineer/ Platform Developer fb@binaryedge.io
  2. 2. WHO WE ARE AND WHAT WE DO VNC RDP Files People Social Company registration internal external Phone Email Linked urls BGP AS Whois AS membership AS peer List of IPs Shared infrastructure Co-hosted sites Contact Geolocation Office locations Social networks Phone portscan dns Screenshots Web Services http https Users AppsFiles Banners Image Classifier Vulnerabilities DATA POINTS metadata Photos Family&friends Behaviour Likes Topics Search News Forums Sub-reddits Domains AXFR MX records Webserver Framework Headers Cookies Certificate Configuration Authorities Entities OCR SW ip address url address SMB torrents peers torrent name categorysource hashes of files
  3. 3. AGENDA 01 02 THE NEED OF A DATA ARCHITECTURE 03 SIMPLE ARCHITECTURE OVERVIEW 04 05 MESSAGE QUEUE STREAM PROCESSING 06 BATCH PROCESSING 07 DATABASES 08 BONUS ROUND: MANAGEMENT 09 ARCHITECTURE REVISITED 10 CLOUD-BASED ARCHITECTURES THE BASIC SURVIVAL KIT
  4. 4. THE NEED OF A DATA ARCHITECTURE Rules before building a data architecture Typical list of needs Think about what you need to do with the data There are no more rules Gather a lot of data coming from different places Process that data in (close to) real-time Make data available in multiple formats Provide ways to easily process that data
  5. 5. SIMPLE ARCHITECTURE OVERVIEW SENSOR STREAM PRO- CESSING SENSOR SENSOR DATA SINK MESSAGE QUEUE FILE STORAGE BATCH PROCESSING DATABASES APIs PORTALS
  6. 6. THE BASIC SURVIVAL KIT Apache Hadoop MapReduce HDFS Yarn Why Apache Hadoop? Interoperability with many other tools Great community Gets the job done THE BASIC SURVIVAL KIT
  7. 7. THE BASIC SURVIVAL KIT YARN Available resources per node for processing Timeouts Heap, heap... HDFS Same as above Primary/ Secondary nodes - high availability Points of attention
  8. 8. MESSAGE QUEUE Apache Kafka Originally developed by LinkedIn Massively scalable publish/ subscribe message queue High troughout Low latency Concepts Topics Consumers Consumer groups Partitions Replicas
  9. 9. MESSAGE QUEUE Points of attention Timeouts Message sizes Retention logs vs cleanup interval !!!! Also, do not, for the love of god, simply delete all the subdirectories in your“kafka-logs”directory, you will cry.
  10. 10. STREAM PROCESSING vs. vs.
  11. 11. STREAM PROCESSING The good parts Very simple programming model and APIs Multilanguage support Points of attention Mini-batch processing, not real stream Heavy resource fingerprint Prone to timeouts of memory errors Hard to fine-tune to get the right performance DataFrame API ML Libraries Wide community Wide range of addons
  12. 12. STREAM PROCESSINGSTREAM PROCESSING The good parts Stream processing Multilanguage support Points of attention Slightly more complex programming model Some support for other languages Works without much configuration effort Low resources configuration Wide community Lots of connectors and addons Great performance, like,“The flash”great
  13. 13. STREAM PROCESSINGSTREAM PROCESSING The good parts Stream processing Multilanguage support Buuuuut..... Does not have a wide community Does not have that many connectors and addons Simple API (very similar to Spark) Dataset API ML Libraries Good handling of resources Low configuration/ optimisation overhead
  14. 14. BATCH PROCESSING Apache Spark Multilanguage support Simple API DataFrame API ML Libraries Wide community Wide range of addons Apache Flink The good parts Multilanguage support Simple API (very similar) DataSet API ML Libraries
  15. 15. BATCH PROCESSING Apache Spark Heavy resource fingerprint Prone to timeouts of memory errors Hard to fine-tune to get the right performance Apache Flink Points of attention Less configuration problems Better handling of resources Not a big community Not many addons
  16. 16. DATABASES Before commiting to a database 01 Think about how you need to access the data 02 Read 1 again 03 Seriously, read 1 again Select a database, based on your needs, i.e.: Hardcore read/ write workload and not much advanced querying: HBase Heavy read/ write workload and minimally dynamic querying: Cassandra Advanced text querying and not such heavy read/ write workload: something else
  17. 17. BONUS ROUND: MANAGEMENT Apache Ambari Provision a Hadoop Cluster Manage a Hadoop Cluster Monitor a Hadoop Cluster Ambari uses Hadoop ecosystem distributions such as: Hortonworks Cloudera
  18. 18. ARCHITECTURE REVISITED SENSOR APACHE STORM SENSOR SENSOR DATA SINK APACHE KAFKA APACHE HDFS APACHE SPARK APACHE HBASE/ CASSANDRA APIs PORTALS
  19. 19. CLOUD BASED ARCHITECTURES Pros Less configuration overhead Less maintenance overhead Easily scalable Reliable Return focus back to data and product Cons $$$$$$$$$$
  20. 20. CLOUD BASED ARCHITECTURES SENSOR GOOGLE DATAFLOW SENSOR SENSOR DATA SINK GOOGLE PUBSUB GOOGLE CLOUD STORAGE GOOGLE DATAPROC APIs PORTALS GOOGLE BIGTABLE/ BIGQUERY
  21. 21. CLOUD BASED ARCHITECTURES SENSOR AMAZON DATA PIPELINE SENSOR SENSOR DATA SINK AMAZON SIMPLE QUEUE SERVICE AMAZON S3 AMAZON ELASTIC MAPREDUCE APIs PORTALS AMAZON DYNAMODB/ REDSHIFT
  22. 22. BE READY. BE SAFE. BE SECURE. BinaryEdge AG Freigutstrasse 40, 8001 Zurich Switzerland info@binaryedge.io www.binaryedge.io + 41 78 713 40 00 CONTIGENCY THREAT SAFE IRRELEVANT
  • saifulmuhajir

    Oct. 30, 2017

We live in a Data-centric era. Nowadays we have at our disposal an enormous variety of services using data. Behind those services there are architectures supporting the flowing and processing of that data. BinaryEdge.io is no exception. Supporting our platform, we have a data architecture processing 1000s of events per second, which was built and is currently maintained by us. In this talk we are going to review the parts that compose a data architecture, and discuss which tools can be used at each step to arrive at a functional architecture. Note that the insights given will not be based of theoretical documents or truckloads of years of experience, but on our own experience of building and maintaining a large scale data infrastructure and architecture

Views

Total views

1,166

On Slideshare

0

From embeds

0

Number of embeds

568

Actions

Downloads

26

Shares

0

Comments

0

Likes

1

×