Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Quick dive into the
Big Data pool
without drowning
Demi Ben-Ari - VP R&D @ Panorays

About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder “Big Things” Big Data Community
In the Past:
● Sr. Data Engineer - Windward
● Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek

Agenda
● Basic Concepts
● Introduction to Big Data frameworks
● Distributed Systems => Problems
● Monitoring
● Conclusions

Say “Distributed”, Say “Big Data”,
Say….

What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)

What is Big Data (IMHO)
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value

What is Big Data (IMHO)
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure

Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows - massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited

What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs

What is the NoSQL landscape?
● 4 broad classes of non-relational databases (DB-Engines)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of
n-numbers of typed columns
● Three key factors to help understand the subject
○ Consistency: Get identical results, regardless which node is queried?
○ Availability: Respond to very high read and write volumes?
○ Partition tolerance: Still available when part of it is down?

What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS

DB Engines - Comparison
● http://db-engines.com/en/ranking

What does DevOps really mean?
Development
Software Engineering
UX
Operations
System Admin
Database Admin

What does DevOps really mean?
DevOps
Cross-functional teams
Operators automating systems
Developers operating systems

Introduction to
Big Data
Frameworks
https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg

Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance

Hadoop Principals
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility Scalability
Low cost
Fault
Tolerant

Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller
blocks in a fail safe manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in
parallel on multiple nodes
● YARN - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)

Hadoop Ecosystem
Hadoop Core
HDFS
MapReduce /
YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark

Hadoop (+Spark) Distributions
Elastic MapReduce DataProc

New Age BI Applications
● Able to understand various types of data
● Ability to clean the data
● Process data with applied rules locally and in distributed environment
● Visualize sizeable data with speed
● Extend results by sharing within the enterprise

Big Data Analytics
● Processing large amounts of data without data movement
● Avoid data connectors if possible (run natively)
● Ability to understand vast amount of data types and and data
compressions
● Ability to process data on variety of processing frameworks
● Distributed data processing
○ In-Memory a big plus
● Super fast visualization
○ In-Memory a big plus

When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful

Distributed Systems => Problems
https://imgflip.com/i/1ap5kr
http://kingofwallpapers.com/otter/otter-004.jpg

Monolith Structure
OS CPU Memory Disk
Processes Java
Application
Server
Database
Web Server
Load
Balancer
Users - Other Applications
Monitoring
System
UI
Many times...all of this was on a single physical server!

Distributed Microservices Architecture
Service A
Queue
DB
Service B
DBCache
Cache DBService C
Web
Server
DB
Analytics Cluster
Master
Slave Slave Slave
Monitoring System???

MongoDB + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
MasterSahrded
MongoDB
Replica Set

Cassandra + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read

Cassandra + Serving
Cassandra
Cluster
Write
Read
UI Client
UI Client
UI Client
UI Client
Web
ServiceWeb
ServiceWeb
ServiceWeb
Service

Problems
● Multiple physical servers
● Multiple logical services
● Want Scaling => More Servers
● Even if you had all of the metrics
○ You’ll have an overflow of the data
● Your monitoring becomes a “Big Data” problem itself

This is what “Distributed” really Means
The DevOps Guy
(It might be you)

Monitoring is Crucial
http://memeguy.com/photo/46871/you-are-being-monitored

Monitoring
Operation System
Metrics

Some help
from “the Cloud”

AWS’s CloudWatch / GCP StackDriver

Report to Where?
● We chose:
● Graphite (InfluxDB) + Grafana
● Can correlate System and
Application metrics in one
place :)

Monitoring Cassandra
● OpsCenter - by DataStax

Ways to Monitoring Spark
● Grafana-spark-dashboards
○ Blog:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
● Spark UI - Online on each application running
● Spark History Server - Offline (After application finishes)
● Spark REST API
○ Querying via inner tools to do ad-hoc monitoring
● Back to the basics: dstat, iostat, iotop, jstack
● Blog post by Tzach Zohar - “Tips from the Trenches”

Monitoring
Your Data
https://memegenerator.net/instance/53617544

Data Questions? What should be measure
● Did all of the computation occur?
○ Are there any data layers missing?
● How much data do we have? (Volume)
● Is all of the data in the Database?
● Data Quality Assurance

Data Answers!
● The method doesn’t really matter, as long as you:
○ Can follow the results over time
○ Know what your data flow, know what might fail
○ It’s easy for anyone to add more monitoring
(For the ones that add the new data each time…)
○ It don’t trust others to add monitoring
(It will always end up the DevOps’s “fault” -> No monitoring will be
applied)

Logging?
Monitoring?
https://lh4.googleusercontent.com/DFVcH-E5XKj8cbhEtI0qabmf_wwVqWWvk0pK5H5rnC_kVxY2tXClKfzV-LvAH61YRLJUEvtO9amjWfjcY4Z57VBYCuQ9
5_hdAVEHgLAuepJiArH0wJERWuzzmgnPysCiIA

ELK - Elasticsearch + Logstash + Kibana
http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/

Monitoring Stack
Alerting
Metrics Collection
Datastore
Dashboard
Data Monitoring
Log Monitoring

Big Data - Are we there yet?
● “3 Vs”: - What are the right questions we want to ask?
○ Volume - How much?
■ Can it run on a single machine in reasonable time?
○ Velocity - How fast?
■ Can a single machine handle the throughput?
○ Variety - What kind? (Difference)
■ Is your data not changing and varying?
● If the answer for most of the previous questions is “Yes”?
Think again if you want to add the complexity of “Big Data”

Conclusions
● Think carefully before going into the “Big Data pool”
○ See if you really have a problem that you’re trying to solve
○ It’s not a silver bullet
● Take measures to automate and monitor everything
● Having Clusters and distributed frameworks will cost a lot - eventually
● Fit your storage layer(s) to the needs

Questions?
https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg
Still feel like you’re
drowning?

● LinkedIn
● Twitter: @demibenari
● Blog:
http://progexc.blogspot.com/
● demi.benari@gmail.com
● “Big Things” Community
Meetup, YouTube, Facebook,
Twitter
● GDG Cloud

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Similar to Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays (20)

More from Demi Ben-Ari

More from Demi Ben-Ari (20)

Recently uploaded

Recently uploaded (20)

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays