SELA DEVELOPER PRACTICE
December 15-19, 2013

Manu Cohen-Yashar

The Cloud, Big Data and
NoSQL

© Copyright SELA software & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com
Agenda
What is the cloud
Data boom
No SQL
Big Data
Cloud Distributions
What’s next
Make sense of : Cloud , Big Data and No SQL
How they fit together

Make money !!!
What is the cloud
Cloud Computing is an Idea …
Infrastructure is provisioned by a cloud
provider.
Automatic Scale.
Elasticity. Pay as you use.
Availability.
Simple, Automatic, Economic.
Type of Clouds
IAAS
PAAS
SAAS
and more…
Identity As A Service
Connectivity As A Service

Storage As A Service
Lots of Data
Data is doubles every 18 month
Pictures
Web site
emails
Sensors
Geo Information
Financial Information
Science
Art
. . . (Infinite list)
No Limits
With the cloud it is now possible to mount any
size if cluster and conduct any computation in
any scale.
The one who will make sense of all available
data will rule the world.

The conclusion:
Use the cloud to analyze large scale of data.
Lets Talk about data
When we think of data we think of …
Data has many forms
Yet data comes in many forms and shapes
Graphs

Time
Series

Documents

Blobs

Geo
Sensors

Structured
Unstructured

Web
No Relational
Not all types of data fit well into the relational
world.
Not all data use cases fit well into the ACID
convention
The relational model does not scale very good
Difficult to distribute
Difficult to replicate
The CAP Theory
During a network partition, a distributed system must choose
either Consistency or Availability.

Sharded
NoSQL

RDBMS

Replicated
NoSQL
NO SQL
Large family of databases
No Schema
No relations enforced
Designed for high scale and distribution

Types of NO SQL DB
Key Value
Wide Columns
Documents
Graph
Motivation for NO SQL
Large Scale and Distribution
Simplicity
Low cost
Good fit with the data model
Volume, Velocity and Variety
Important

There is no one NO SQL solution for all
use cases
There are over than 150 possible offerings…
The Cloud and NO SQL
All Cloud Providers have NO SQL solutions
Azure Tables
Google Big Table
Amazon DynamoDB

NO SQL Databases are deployed on a cluster
There are large number of cloud hosting offerings for
no-sql clusters
MongoHQ (MongoDB)
Cassandra on Google Compute engine
Many more
Example – Mongo in Azure
Big Data
What is Big?
“Big” cannot fit on a single machine.

Conclusion:
Big data has to be distributed.
Types of Big Data Processing
Query
General Analysis
Classification
Recommendation
Clustering
Auditing and monitoring
More…
Challenges
Develop a parallel algorithm
Reduce the network traffic -> bring compute to
data
Monitor and manage large number of parallel
tasks
Survive failures
Performance
Linear scale
Batch Processing VS Operational
Intelligence
Batch Processing
Work on existing data
Provide results within minutes

Operational Intelligence
Work on stream of data
Provide real-time results
Distributed File System
No one server can store Big Data files
Distribute files across cluster
Failure is part of the game
Similar API to traditional File Systems
Examples:
HDFS
GFS
Cassandra FS
Mongo FS
Hadoop
Big Data Analysis Platform
Batch Processing
Brings Compute tasks to data nodes
Parallel Processing using Map-Reduce
Open Source
Huge eco system
Hadoop Eco System
Writing a valuable Map-Reduce job for Hadoop
is not simple
Many open source projects provide
abstractions
Pig
Hive
HBase
Sqoop
Mahout
ZooKeeper
More
Hadoop on the Cloud
Hadoop runs on a cluster
You can use a cluster as a service on major
cloud offerings
Storm
Real-Time big data analytics
Process streams of data
Can be used with any programming language
Wide integration with data sources
Check your schema
Be open to use NO-SQL data stores
Identify your use-case and find the right
database for you
Create a simple POC
Look for Big Data
Ask yourself: What can I gain from big data?
How the new data or analysis scope can enhance
your existing set of capabilities?
What additional opportunities for intervention or
processes optimisation does it present?

Identify your use case and find the right product
and data model.
Look for web distributions and create a simple
POC
Questions

Big data, Cloud Computing and No SQL

  • 1.
    SELA DEVELOPER PRACTICE December15-19, 2013 Manu Cohen-Yashar The Cloud, Big Data and NoSQL © Copyright SELA software & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com
  • 2.
    Agenda What is thecloud Data boom No SQL Big Data Cloud Distributions What’s next
  • 4.
    Make sense of: Cloud , Big Data and No SQL How they fit together Make money !!!
  • 5.
    What is thecloud Cloud Computing is an Idea … Infrastructure is provisioned by a cloud provider. Automatic Scale. Elasticity. Pay as you use. Availability. Simple, Automatic, Economic.
  • 6.
    Type of Clouds IAAS PAAS SAAS andmore… Identity As A Service Connectivity As A Service Storage As A Service
  • 7.
    Lots of Data Datais doubles every 18 month Pictures Web site emails Sensors Geo Information Financial Information Science Art . . . (Infinite list)
  • 8.
    No Limits With thecloud it is now possible to mount any size if cluster and conduct any computation in any scale. The one who will make sense of all available data will rule the world. The conclusion: Use the cloud to analyze large scale of data.
  • 9.
    Lets Talk aboutdata When we think of data we think of …
  • 10.
    Data has manyforms Yet data comes in many forms and shapes Graphs Time Series Documents Blobs Geo Sensors Structured Unstructured Web
  • 11.
    No Relational Not alltypes of data fit well into the relational world. Not all data use cases fit well into the ACID convention The relational model does not scale very good Difficult to distribute Difficult to replicate
  • 12.
    The CAP Theory Duringa network partition, a distributed system must choose either Consistency or Availability. Sharded NoSQL RDBMS Replicated NoSQL
  • 13.
    NO SQL Large familyof databases No Schema No relations enforced Designed for high scale and distribution Types of NO SQL DB Key Value Wide Columns Documents Graph
  • 14.
    Motivation for NOSQL Large Scale and Distribution Simplicity Low cost Good fit with the data model Volume, Velocity and Variety
  • 15.
    Important There is noone NO SQL solution for all use cases There are over than 150 possible offerings…
  • 16.
    The Cloud andNO SQL All Cloud Providers have NO SQL solutions Azure Tables Google Big Table Amazon DynamoDB NO SQL Databases are deployed on a cluster There are large number of cloud hosting offerings for no-sql clusters MongoHQ (MongoDB) Cassandra on Google Compute engine Many more
  • 17.
  • 18.
    Big Data What isBig? “Big” cannot fit on a single machine. Conclusion: Big data has to be distributed.
  • 19.
    Types of BigData Processing Query General Analysis Classification Recommendation Clustering Auditing and monitoring More…
  • 20.
    Challenges Develop a parallelalgorithm Reduce the network traffic -> bring compute to data Monitor and manage large number of parallel tasks Survive failures Performance Linear scale
  • 21.
    Batch Processing VSOperational Intelligence Batch Processing Work on existing data Provide results within minutes Operational Intelligence Work on stream of data Provide real-time results
  • 22.
    Distributed File System Noone server can store Big Data files Distribute files across cluster Failure is part of the game Similar API to traditional File Systems Examples: HDFS GFS Cassandra FS Mongo FS
  • 23.
    Hadoop Big Data AnalysisPlatform Batch Processing Brings Compute tasks to data nodes Parallel Processing using Map-Reduce Open Source Huge eco system
  • 24.
    Hadoop Eco System Writinga valuable Map-Reduce job for Hadoop is not simple Many open source projects provide abstractions Pig Hive HBase Sqoop Mahout ZooKeeper More
  • 25.
    Hadoop on theCloud Hadoop runs on a cluster You can use a cluster as a service on major cloud offerings
  • 26.
    Storm Real-Time big dataanalytics Process streams of data Can be used with any programming language Wide integration with data sources
  • 28.
    Check your schema Beopen to use NO-SQL data stores Identify your use-case and find the right database for you Create a simple POC
  • 29.
    Look for BigData Ask yourself: What can I gain from big data? How the new data or analysis scope can enhance your existing set of capabilities? What additional opportunities for intervention or processes optimisation does it present? Identify your use case and find the right product and data model. Look for web distributions and create a simple POC
  • 30.

Editor's Notes

  • #13 Consistency: A read sees all previously completed writes.Availability: Reads and writes always succeed.Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.https://foundationdb.com/white-papers/the-cap-theorem/The basic idea is that if a client writes to one side of a partition, any reads that go to the other side of that partition can't possibly know about the most recent write. Now you're faced with a choice: do you respond to the reads with potentially stale information, or do you wait (potentially forever) to hear from the other side of the partition and compromise availability?