Quick dive into the
Big Data pool
without drowning
Demi Ben-Ari - VP R&D @ Panorays
About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder “Big Things” Big Data Community
In the Past:
● Sr. Data Engineer - Windward
● Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek
Agenda
● Basic Concepts
● Introduction to Big Data frameworks
● Distributed Systems => Problems
● Monitoring
● Conclusions
Say “Distributed”, Say “Big Data”,
Say….
Some basic concepts
What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
What is Big Data (IMHO)
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value
What is Big Data (IMHO)
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure
Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows - massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited
What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs
What is the NoSQL landscape?
● 4 broad classes of non-relational databases (DB-Engines)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of
n-numbers of typed columns
● Three key factors to help understand the subject
○ Consistency: Get identical results, regardless which node is queried?
○ Availability: Respond to very high read and write volumes?
○ Partition tolerance: Still available when part of it is down?
What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS
DB Engines - Comparison
● http://db-engines.com/en/ranking
DB Engines - Comparison
What does DevOps really mean?
Development
Software Engineering
UX
Operations
System Admin
Database Admin
What does DevOps really mean?
DevOps
Cross-functional teams
Operators automating systems
Developers operating systems
Introduction to
Big Data
Frameworks
https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg
Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance
Hadoop Principals
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility Scalability
Low cost
Fault
Tolerant
Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller
blocks in a fail safe manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in
parallel on multiple nodes
● YARN - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)
Hadoop Ecosystem
Hadoop Core
HDFS
MapReduce /
YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark
Hadoop (+Spark) Distributions
Elastic MapReduce DataProc
New Age BI Applications
● Able to understand various types of data
● Ability to clean the data
● Process data with applied rules locally and in distributed environment
● Visualize sizeable data with speed
● Extend results by sharing within the enterprise
Big Data Analytics
● Processing large amounts of data without data movement
● Avoid data connectors if possible (run natively)
● Ability to understand vast amount of data types and and data
compressions
● Ability to process data on variety of processing frameworks
● Distributed data processing
○ In-Memory a big plus
● Super fast visualization
○ In-Memory a big plus
When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful
Distributed Systems => Problems
https://imgflip.com/i/1ap5kr
http://kingofwallpapers.com/otter/otter-004.jpg
Monolith Structure
OS CPU Memory Disk
Processes Java
Application
Server
Database
Web Server
Load
Balancer
Users - Other Applications
Monitoring
System
UI
Many times...all of this was on a single physical server!
Distributed Microservices Architecture
Service A
Queue
DB
Service B
DBCache
Cache DBService C
Web
Server
DB
Analytics Cluster
Master
Slave Slave Slave
Monitoring System???
MongoDB + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
MasterSahrded
MongoDB
Replica Set
Cassandra + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read
Cassandra + Serving
Cassandra
Cluster
Write
Read
UI Client
UI Client
UI Client
UI Client
Web
ServiceWeb
ServiceWeb
ServiceWeb
Service
Problems
● Multiple physical servers
● Multiple logical services
● Want Scaling => More Servers
● Even if you had all of the metrics
○ You’ll have an overflow of the data
● Your monitoring becomes a “Big Data” problem itself
This is what “Distributed” really Means
The DevOps Guy
(It might be you)
Monitoring is Crucial
http://memeguy.com/photo/46871/you-are-being-monitored
Monitoring
Operation System
Metrics
Some help
from “the Cloud”
AWS’s CloudWatch / GCP StackDriver
Report to Where?
● We chose:
● Graphite (InfluxDB) + Grafana
● Can correlate System and
Application metrics in one
place :)
Monitoring
Cassandra
Monitoring Cassandra
● OpsCenter - by DataStax
Monitoring Cassandra
Monitoring Spark
Ways to Monitoring Spark
● Grafana-spark-dashboards
○ Blog:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
● Spark UI - Online on each application running
● Spark History Server - Offline (After application finishes)
● Spark REST API
○ Querying via inner tools to do ad-hoc monitoring
● Back to the basics: dstat, iostat, iotop, jstack
● Blog post by Tzach Zohar - “Tips from the Trenches”
Monitoring
Your Data
https://memegenerator.net/instance/53617544
Data Questions? What should be measure
● Did all of the computation occur?
○ Are there any data layers missing?
● How much data do we have? (Volume)
● Is all of the data in the Database?
● Data Quality Assurance
Data Answers!
● The method doesn’t really matter, as long as you:
○ Can follow the results over time
○ Know what your data flow, know what might fail
○ It’s easy for anyone to add more monitoring
(For the ones that add the new data each time…)
○ It don’t trust others to add monitoring
(It will always end up the DevOps’s “fault” -> No monitoring will be
applied)
Logging?
Monitoring?
https://lh4.googleusercontent.com/DFVcH-E5XKj8cbhEtI0qabmf_wwVqWWvk0pK5H5rnC_kVxY2tXClKfzV-LvAH61YRLJUEvtO9amjWfjcY4Z57VBYCuQ9
5_hdAVEHgLAuepJiArH0wJERWuzzmgnPysCiIA
ELK - Elasticsearch + Logstash + Kibana
http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/
Monitoring Stack
Alerting
Metrics Collection
Datastore
Dashboard
Data Monitoring
Log Monitoring
Big Data - Are we there yet?
● “3 Vs”: - What are the right questions we want to ask?
○ Volume - How much?
■ Can it run on a single machine in reasonable time?
○ Velocity - How fast?
■ Can a single machine handle the throughput?
○ Variety - What kind? (Difference)
■ Is your data not changing and varying?
● If the answer for most of the previous questions is “Yes”?
Think again if you want to add the complexity of “Big Data”
Conclusions
● Think carefully before going into the “Big Data pool”
○ See if you really have a problem that you’re trying to solve
○ It’s not a silver bullet
● Take measures to automate and monitor everything
● Having Clusters and distributed frameworks will cost a lot - eventually
● Fit your storage layer(s) to the needs
Questions?
https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg
Still feel like you’re
drowning?
● LinkedIn
● Twitter: @demibenari
● Blog:
http://progexc.blogspot.com/
● demi.benari@gmail.com
● “Big Things” Community
Meetup, YouTube, Facebook,
Twitter
● GDG Cloud
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

  • 1.
    Quick dive intothe Big Data pool without drowning Demi Ben-Ari - VP R&D @ Panorays
  • 2.
    About Me Demi Ben-Ari,Co-Founder & VP R&D @ Panorays ● BS’c Computer Science – Academic College Tel-Aviv Yaffo ● Co-Founder “Big Things” Big Data Community In the Past: ● Sr. Data Engineer - Windward ● Team Leader & Sr. Java Software Engineer, Missile defense and Alert System - “Ofek” – IAF Interested in almost every kind of technology – A True Geek
  • 3.
    Agenda ● Basic Concepts ●Introduction to Big Data frameworks ● Distributed Systems => Problems ● Monitoring ● Conclusions
  • 4.
    Say “Distributed”, Say“Big Data”, Say….
  • 5.
  • 6.
    What is BigData (IMHO)? ● Systems involving the “3 Vs”: What are the right questions we want to ask? ○ Volume - How much? ○ Velocity - How fast? ○ Variety - What kind? (Difference)
  • 7.
    What is BigData (IMHO) ● Some define it the “7 Vs” ○ Variability (constantly changing) ○ Veracity (accuracy) ○ Visualization ○ Value
  • 8.
    What is BigData (IMHO) ● Characteristics ○ Multi-region availability ○ Very fast and reliable response ○ No single point of failure
  • 9.
    Why Not RelationalData ● Relational Model Provides ○ Normalized table schema ○ Cross table joins ○ ACID compliance (Atomicity, Consistency, Isolation, Durability) ● But at very high cost ○ Big Data table joins - bilions of rows - massive overhead ○ Sharding tables across systems is complex and fragile ● Modern applications have different priorities ○ Needs for speed and availability come over consistency ○ Commodity servers racks trump massive high-end systems ○ Real world need for transactional guarantees is limited
  • 10.
    What strategies helpmanage Big Data? ● Distribute data across nodes ○ Replication ● Relax consistency requirements ● Relax schema requirements ● Optimize data to suit actual needs
  • 11.
    What is theNoSQL landscape? ● 4 broad classes of non-relational databases (DB-Engines) ○ Graph: data elements each relate to N others in graph / network ○ Key-Value: keys map to arbitrary values of any data type ○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed columns ● Three key factors to help understand the subject ○ Consistency: Get identical results, regardless which node is queried? ○ Availability: Respond to very high read and write volumes? ○ Partition tolerance: Still available when part of it is down?
  • 12.
    What is theCAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in a manually dependant relationship, Pick any two. Availability Partition toleranceConsistency MySQL, PostgreSQL, Greenplum, Vertica, Neo4J Cassandra, DynamoDB, Riak, CouchDB, Voldemort HBase, MongoDB, Redis, BigTable, BerkeleyDB Graph Key-Value Wide Column RDBMS
  • 13.
    DB Engines -Comparison ● http://db-engines.com/en/ranking
  • 14.
    DB Engines -Comparison
  • 15.
    What does DevOpsreally mean? Development Software Engineering UX Operations System Admin Database Admin
  • 16.
    What does DevOpsreally mean? DevOps Cross-functional teams Operators automating systems Developers operating systems
  • 17.
  • 18.
    Characteristics of Hadoop ●A system to process very large amounts of unstructured and complex data with wanted speed ● A system to run on a large amount of machines that don’t share any memory or disk ● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance
  • 19.
    Hadoop Principals ● “Asystem to move the computation, where the data is” ● Key Concepts of Hadoop Flexibility Scalability Low cost Fault Tolerant
  • 20.
    Hadoop Core Components ●HDFS - Hadoop Distributed File System ○ Provides a distributed data storage system to store data in smaller blocks in a fail safe manner ● MapReduce - Programming framework ○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes ● YARN - (Yet Another Resource Negotiator) MRv2 ○ Splitting a MapReduce Job Tracker’s info ■ Resource Manager (Global) ■ Application Manager (Per application)
  • 21.
    Hadoop Ecosystem Hadoop Core HDFS MapReduce/ YARN Hadoop Common Hadoop Applications Hive Pig HBase Oozie Zookeeper Sqoop Spark
  • 22.
  • 23.
    New Age BIApplications ● Able to understand various types of data ● Ability to clean the data ● Process data with applied rules locally and in distributed environment ● Visualize sizeable data with speed ● Extend results by sharing within the enterprise
  • 24.
    Big Data Analytics ●Processing large amounts of data without data movement ● Avoid data connectors if possible (run natively) ● Ability to understand vast amount of data types and and data compressions ● Ability to process data on variety of processing frameworks ● Distributed data processing ○ In-Memory a big plus ● Super fast visualization ○ In-Memory a big plus
  • 25.
    When to choosehadoop? ● Large volumes of data to store and process ● Semi-Structured or Unstructured data ● Data is not well categorized ● Data contains a lot of redundancy ● Data arrives in streams or large batches ● Complex batch jobs arriving in parallel ● You don’t know how the data might be useful
  • 26.
    Distributed Systems =>Problems https://imgflip.com/i/1ap5kr http://kingofwallpapers.com/otter/otter-004.jpg
  • 27.
    Monolith Structure OS CPUMemory Disk Processes Java Application Server Database Web Server Load Balancer Users - Other Applications Monitoring System UI Many times...all of this was on a single physical server!
  • 28.
    Distributed Microservices Architecture ServiceA Queue DB Service B DBCache Cache DBService C Web Server DB Analytics Cluster Master Slave Slave Slave Monitoring System???
  • 29.
    MongoDB + Spark Worker1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read MasterSahrded MongoDB Replica Set
  • 30.
    Cassandra + Spark Worker1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  • 31.
    Cassandra + Serving Cassandra Cluster Write Read UIClient UI Client UI Client UI Client Web ServiceWeb ServiceWeb ServiceWeb Service
  • 32.
    Problems ● Multiple physicalservers ● Multiple logical services ● Want Scaling => More Servers ● Even if you had all of the metrics ○ You’ll have an overflow of the data ● Your monitoring becomes a “Big Data” problem itself
  • 33.
    This is what“Distributed” really Means The DevOps Guy (It might be you)
  • 34.
  • 35.
  • 36.
  • 37.
    AWS’s CloudWatch /GCP StackDriver
  • 38.
    Report to Where? ●We chose: ● Graphite (InfluxDB) + Grafana ● Can correlate System and Application metrics in one place :)
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Ways to MonitoringSpark ● Grafana-spark-dashboards ○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ ● Spark UI - Online on each application running ● Spark History Server - Offline (After application finishes) ● Spark REST API ○ Querying via inner tools to do ad-hoc monitoring ● Back to the basics: dstat, iostat, iotop, jstack ● Blog post by Tzach Zohar - “Tips from the Trenches”
  • 44.
  • 45.
    Data Questions? Whatshould be measure ● Did all of the computation occur? ○ Are there any data layers missing? ● How much data do we have? (Volume) ● Is all of the data in the Database? ● Data Quality Assurance
  • 46.
    Data Answers! ● Themethod doesn’t really matter, as long as you: ○ Can follow the results over time ○ Know what your data flow, know what might fail ○ It’s easy for anyone to add more monitoring (For the ones that add the new data each time…) ○ It don’t trust others to add monitoring (It will always end up the DevOps’s “fault” -> No monitoring will be applied)
  • 47.
  • 48.
    ELK - Elasticsearch+ Logstash + Kibana http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/
  • 49.
  • 50.
    Big Data -Are we there yet? ● “3 Vs”: - What are the right questions we want to ask? ○ Volume - How much? ■ Can it run on a single machine in reasonable time? ○ Velocity - How fast? ■ Can a single machine handle the throughput? ○ Variety - What kind? (Difference) ■ Is your data not changing and varying? ● If the answer for most of the previous questions is “Yes”? Think again if you want to add the complexity of “Big Data”
  • 51.
    Conclusions ● Think carefullybefore going into the “Big Data pool” ○ See if you really have a problem that you’re trying to solve ○ It’s not a silver bullet ● Take measures to automate and monitor everything ● Having Clusters and distributed frameworks will cost a lot - eventually ● Fit your storage layer(s) to the needs
  • 52.
  • 53.
    ● LinkedIn ● Twitter:@demibenari ● Blog: http://progexc.blogspot.com/ ● demi.benari@gmail.com ● “Big Things” Community Meetup, YouTube, Facebook, Twitter ● GDG Cloud