Big data 2012 v1

Paradigm Shifts:

Big Data

Pini Cohen
VP and Senior Analyst

Tell me and I’ll forget
Show me and I may remember STKI Summit 2012
Involve me and I’ll understand

The “Magic” of internet companies

Source: http://venturebeat.com/2011/10/24/next-hot-internet-companies-not-in-us/internet-company-growth/
Pini Cohen’s work Copyright STKI@2012
2
Do not remove source or attribution from any slide or graph

Pinterest

Do not remove source or attribution from any slide or graph 3

Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410
TB of Data

• 80 million objects stored in S3 with 410 terabytes of user
data, 10x what they had in August. EC2 instances have
grown by 3x. Around $39K fo S3 and $30K for EC2 a month.
• Pay for what you use saves money. Most traffic happens in
the afternoons and evenings, so they reduce the number of
instances at night by 40%.
• 12 employees as of last December. Using the cloud a site can
grow dramatically while maintaining a very small team.
Looks like 31 employees as of now.

Source: http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html


Instagram

• The Instagram philosophy:
• Simplicity
• Optimized for minimal operational burden
• Instrument everything


Scaling Instagram

• Instagram went to 30+ million users in less than two years
and then rocketed to 40 million users 10 days after the
launch of its Android application.
• After the release of the Android they had 1 million new
users in 12 hours.

• 2 engineers in 2010.
• 3 engineers in 2011
• 5 engineers 2012, 2.5 on the backend. This includes
iPhone and Android development.

Source: http://highscalability.com/blog/2012/4/16/instagram-architecture-update-whats-new-with-instagram.html


Tumblr – Microbloging social networking platform

• 500 million page views a day
• 15B+ page views month
• Peak rate of ~40k requests per second
• 1+ TB/day into Hadoop cluster
• Many TB/day into MySQL/HBase/Redis/Memcache
• Growing at 30% a month
• ~1000 hardware nodes in production (not cloud)
• ~20 engineers (total 106 employees)

Source: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html STKI modifications


Technology listing

• Hadoop Mapreduce
• NoSQL dbms (Cassandra, Mongo, HBASE)
• Shrading
• In Memory DBMS
• Memcashed
• MemSQL
• Solr
• Redis
• DJANGO
• Python
• ELB - Elastic load balancing amazon


Paradigm shifts agenda

• Big Data:
• Big Data definition and background
• Big Data value
• Big Data technology

Source: http://www.b2binbound.com/blog/?Tag=paradigm%20shift


Big Data Definition – 4 V’s (or more…)

• Volume – tens of TBs and more (15-20TB+)
• Velocity – the speed in which data is added – 10M items
per hour and more. And the speed in which the data needs
to be processed
• Variety – different types of data – structured &
unstructured. In many cases deals with internet of things,
social media, but also with voice, video, etc.
• Variability - able to cope with new attributes and changing
data types – without interrupting the analytical process
(without “import-export”)
• Other optional V’s - validity, volatility, viscosity (resistance
to flow), etc. source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html


The origins of the 3V’s:

• 2002 research by Doug Laney from META Group (now
Gartner):


“Big Data” theme main current usage:

• “Big Data" is just marketing jargon. -Doug Laney,
Gartner source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html

Source: http://winnbadisa.com/wp-content/uploads/2011/12/marketing-career-cloud.jpg
• STKI : doing something significantly different from
what you’ve done until now


Big Data at work:

• Orbitz Worldwide has collected 750 terabytes of
unstructured data on their consumers’ behavior – detailed
information from customer online visits and browsing
sessions. Using Hadoop, models have been developed
intended to improve search results and tailor the user
experience based on everything from location, interest in
family travel versus solo travel, and even the kind of device
being used to explore travel options.
• The result? To date, a 7% increase in interaction rate, 37%
growth in stickiness of sessions and a net 2.6% in booking
path engagement.

Source: http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_cons_techtrends2012_013112.pdf


DW appliances will be discussed later

Teradata EMC Greenplun Oracle Exadata

Source: http://www.asugnews.com/2011/09/06/inside-saps-product-naming-strategies/
14
Microsoft Parallel Data Warehouse

What is the business value of big data analytics?

• Big data is now a technology looking for a business need
• It can mean doing the same thing but better / faster
(better segmentation, more accurate analysis model)
• Or it can mean doing completely new things (telematics,
sentiment analysis, recommendation engine, matching
competition’s pricing in real time, being able to analyze
data we haven’t been able to analyze in the past)


Decision making – old school vs. new school (big data)

• Old School:
• Phase 1 : Analyze existing data and prepare general model
• Phase 2: Apply the general model to specific client
• This means applying the same model for many clients when they
arrive
• Issues with Old School decision making:
• Time gap between preparing and applying the model
• # of combinations might be too big for general model (example:
recommendation based in interest)
• The general model generated is biased towards “main stream”
population
• New School (Big Data):
• Phase 1: Prepare specific model for the client and apply the model
– instantly


Big data use cases

• Recommendation engines – match users to one another
and provide recommendation based on similar users
(Examples: Linkedin – people you may know; Amazon)
• Sentiment Analysis (Macro or individual user)
• Fraud Detection - customer behavior, historical and
transactional data combined. Same but more affordable
• Customer Churn
• Social graph analysis – influencers
• Customer experience analysis – combine data from call
center, web, social media etc.
• Improved segmentation – more data (clickstream, call
records) for more accurate analysis
• Improved customer retention


Technology: Elements Concepts

• Storing data for analytics (mainly):
• HDFS – Hadoop File System
• Map Reduce- Programming method mainly for analytics
• Other “Add-on”: Pig, , Hive, JAQL (IBM)
• Storing and retrieving data - DBMS:
• NoSQL – DBMS (not only SQL):
• Cassandra
• MongoDB
• CouchDB
• Hbase


Who Uses Hadoop?

• Amazon/A9  Quantcast
• AOL
 Rackspace/Mailtrust
• Facebook
• Fox interactive media
 Veoh
• Netflix  Yahoo!
• New York Times  PowerSet (now
Microsoft)

More at http://wiki.apache.org/hadoop/PoweredBy


Who Uses Cassandra?

• Facebook  SimpleGeo
• Digg  Rackspace
• Despegar  Shazam
• Ooyala  SoftwareProjects
• Imagini


Big Data technologies (Hadoop etc.) vs. traditional IT

Traditional IT Big Data
Centralized Storage Local storage
Brand redundant Servers Cheap HW White Boxes
Standard Infrastructure and virtual Is standardization needed?! (in the HW
servers. level). No server virtualization.
Well established backup and DRP Why do I need backup? How do I tackle
procedures DRP (compute clusters that are stretched
over locations)
Traditional vendors Open Source solutions
Mature products and procedures In a new patch for specific issues
sometimes it is written “not implemented
yet”
Traditional programming, SQL Different kind of programming (map-
reduce) , no Joins
Will Big Data infrastructure be part of existing infrastructure or will be
developed as new domain?

New type of scale:

• Hadoop:
• Up to 4,000 machines in a cluster
• Up to 20 PB in a cluster
• Currently traditional IT technologies can not handle this
kind of scale.
• This scale comes with a cost!

Source: http://www.techsangam.com/wp-content/uploads/2012/01/i_love_scalability_mug.jpg


Brewer's (CAP) Theorem

• It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
• Consistency (all nodes see the same data at the same time)
• Availability (node failures do not prevent survivors from
continuing to operate)
• Partition Tolerance (the system continues to operate in many
partitions and despite arbitrary message loss)

Source: Scalebase STKI modifications

Professor Eric A. Brewer

Dealing With CAP

• Drop Consistency
• Welcome to the “Eventually Consistent” term.
• At the end – everything will work out just fine - And hey, sometimes
this is a good enough solution
• When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
• For a given accepted update and a given node, eventually either
the update reaches the node or the node is removed from service
• Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID

Source: Scalebase

Hadoop

• Apache Hadoop is a software framework that supports
data-intensive distributed applications
• It enables applications to work with thousands of nodes
and petabytes of data.
• Hadoop was inspired by Google's MapReduce and Google
File System (GFS) papers
• Contains (basically):
• HDFS – Hadoop file System
• MapReduce programming model


HDFS – Hadoop File System

• Parallel
• Distributed on commodity elements
• Throughput over latency
• Reliable and self healing
• For large scale – typical file is gigabytes to terabytes (for
one file!)
• Applications need a write-once-read-many access
model (mainly analytics)


HDFS motivation

• What if you needed to write a program that distributes
data on commodity HW (PC’s or Servers). You would need
to take care of:
• Where is the data located
• How to distribute data between the nodes
• How many times you want to replicate the data
• How to insert, select and update data
• What to do if one node or more fails
• How to add node or to take out a node
• Manage and monitor the environment
• Hadoop File System did it for you!


HDFS: Hadoop Distributed File Systems

• Data nodes and Name node
• Client requests meta data about a file from namenode
• Data is served directly from datanode

HDFS namenode
Application
(file name, block id)
HDFS Client File namespace /user/css534/input
(block id, block location)
block 3df2

instructions state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux local file system Linux local file system

… …

source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA


Datanode Blockreports

File “part-0” will be
replicated twice and will
populatesaved in blocks 1
and 3 (file is big so it has to
be divided to 2 blocks)

Block 1 is on data nodes A and C

source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA


HDFS basic limitations

• Namenode is single point of failure
• Write-once model
• Plan to support appending-writes
• A namespace with an extremely large number of files
exceeds Namenode’s capacity to maintain
• Cannot be mounted by exisiting OS
• Getting data in and out is tedious
• HDFS does not implement / support user quotas / access
permissions
• Data balancing schemes
• No periodic checkpoints


Map Reduce programming model

• In very basic – Brings the program to the data
• Contains two elements:
• Map: this part of the job is performed in parallel asynchronous
by each node
• Reduce: gather the result from the relevant nodes
• In more detail :
• Map : return (write on temp file) a list containing zero or more
( k, v ) pairs
• Output can be a different key from the input
• Output can have same key
• Reduce : return a new list of reduced output from input


MapReduce motivation

• What if you needed to write a program that processes data
that’s on distributed computers?
• You would need to write distributed program that:
• Finds where the data located
• Work on each node and then combine the result from each node
together.
• Where (on the local node) and how (format) to write the
intermediate results
• Find when the jobs of all participating nodes have concluded and
then start the “aggregation” part
• What to do if a job is stuck (restart the job or turn to another node
to perform the same job)
• Hadopp MapReduce is the framework for you!


MapReduce example:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));


Dataflow in Hadoop

Master Job: Word Count
Submit job

All elements – standard HW

map schedule reduce

map reduce

Source: Haifa Labs IBM

Dataflow in Hadoop

Hello World Bye World
Read Hello 1
Input File World 2
map reduce
Block 1 Bye1

Hello Hadoop Goodbye Hadoop
HDFS
Block 2 Hello 1
map Hadoop 2 reduce
Goodbye 1


Dataflow in Hadoop

Finished Finished + Location

map Local
FS
reduce

Local
map FS reduce


Dataflow in Hadoop

map Local
FS
reduce

HTTP GET
Local
map FS reduce


Dataflow in Hadoop

Write
Final
reduce
Answer
HDFS

reduce Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2


Components of Cluster Node

Flow File Input
Processor

Flow Analysis Flow Analysis • Flow file
Cluster File Map Reduce
Cluster File
Map Reduce input processor
System
(System)
HDFS • Flow analysis
flow- ( HDFS )
MapReduce Library map/reduce
tools
• Flow-tools
Hadoop • Hadoop
• HDFS
Java Virtual Machine
• MapReduce
Operating System ( Linux ) • Java VM
• OS : Linux
Hardware ( CPU, HDD, Memory, NIC )
Source: www.caida.org/workshops/.../wide-casfi1004_wkang.ppt


Hive: MapReduce helper:

• Code Example:
• hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
• hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a
WHERE a.key < 100;
• hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.*
FROM events a;
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites,
a.pokes FROM profiles a;
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*)
FROM invites a WHERE a.ds='2008-08-15';
• hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar
FROM invites a;
• hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT
SUM(a.pc) FROM pc1 a;


NoSQL DBMS: storing and retrieving data

• Key/Value
• A big hash table
• Examples: Voldemort, Amazon’s Dynamo
• Big Table
• Big table, column families
• Examples: Hbase, Cassandra
• Document based
• Collections of collections
• Examples: CouchDB, MongoDB
• Graph databases
• Based on graph theory
• Examples: Neo4J
• Each solves a different problem

Source: Scalebase


Pros/Cons

• Pros:
• Performance
• BigData
• Most solutions are open source
• Data is replicated to nodes and is therefore fault-tolerant
(partitioning)
• Don't require a schema
• Can scale up and down
• Cons:
• Code change
• No framework support
• Not ACID
• Eco system (BI, Backup)
• There is always a database at the backend
• Some API is just too simple
Source: Scalebase


Apache Cassandra

• Cassandra is a highly scalable, eventually
consistent, distributed, structured key-value
store
• Child of Google’s BigTable and Amazon’s
Dynamo
• Peer to peer architecture. All nodes are equal Source: ids.snu.ac.kr/w/images/1/18/2011SS-03.ppt

• Cassandra’s replication factor (RF) is the total
number of nodes onto which the data will be
placed. RF of at least 2 is highly recommended,
keeping in mind that your effective number of
nodes is (N total nodes / RF).
• CQL (Cassandra Query Language) command line
• Time stamp for each value written


Consistent Hashing

• Partition using consistent hashing (for the
first node data is placed) based on MD5
Distributed hash table algorithm A
• Keys hash to a point on a fixed circular
C
space V B
• Ring is partitioned into a set of ordered
slots and servers and keys hashed over
these slots
• Nodes take positions on the circle. S D
• A, B, and D exists.
• B responsible for AB range ( for replication
factor=2 – default).
• D responsible for BD range.
• A responsible for DA range. R H
• C joins.
• B, D split ranges. M
• C gets BC from D.
Source: http://www.intertech.com/resource/usergroup/NoSQL.ppt


Cassandra’s tunable consistency (write)

Level Behavior
Ensure that the write has been written to at least 1 node, including HintedHandoff
ANY
recipients.
Ensure that the write has been written to at least 1 replica's commit log and
ONE
memory table before responding to the client.
Ensure that the write has been written to at least 2 replica's before responding to
TWO
the client.
Ensure that the write has been written to at least 3 replica's before responding to
THREE
the client.
Ensure that the write has been written to N / 2 + 1 replicas before responding to the
QUORUM
client.
Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within
LOCAL_QUORUM
the local datacenter (requires NetworkTopologyStrategy)

Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each
EACH_QUORUM
datacenter (requires NetworkTopologyStrategy)

Ensure that the write is written to all N replicas before responding to the client. Any
ALL
unresponsive replicas will fail the operation.

Do not remove source or attribution from any slide or graph Source: wiki
45

Cassandra’s data model structure

Think of cassandra as row-oriented
keyspace

column family
settings
(eg,
partitioner) settings column
(eg,
comparator,
type [Std]) name value clock

Source: http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt


Data Model – “flexible” scheme!

ColumnFamily: Rockets

Key Value

1 Name Value

name Rocket-Powered Roller Skates
toon Ready, Set, Zoom
inventoryQty 5
brakes false

2 Name Value

name Little Giant Do-It-Yourself Rocket-Sled Kit
toon Beep Prepared
inventoryQty 4
brakes false

3 Name Value

name Acme Jet Propelled Unicycle
toon Hot Rod and Reel
inventoryQty 1
wheels 1
Source: http://wenku.baidu.com/view/6e254321482fb4daa58d4b87.html


Cassandra’s CQL – Cassandra SQL Language

• SQL like. Example:
• CREATE KEYSPACE test with strategy_class = 'SimpleStrategy' and
strategy_options:replication_factor=1;
• CREATE INDEX ON users (birth_date);
• SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
• However:
• No Joins
• No UPDATES/DELETES


NoSQL benchmark – for scale!

Source: r esearch.yahoo.com/files/ycsb-v4.pdf


Can we live with NoSQL limitations?

• Facebook has dropped Cassandra
• “..we found Cassandra's eventual consistency model to be a
difficult pattern to reconcile for our new Messages
infrastructure”
• Facebook has selected HBase (Columnar DBMS) .
http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-
messages/454991608919


What about other NoSQL DBMS?

• MongoDB
• Hbase
• CouchDB
• Maybe next session….


Big Data potential implications on IT

• Will traditional RDBMS be obsolete? Surely no!
• Several areas are Big Data zone by definition – Internet
marketing, Cyber, DW, etc.
• How well can we live with “Eventually Consistent” which in
most cases means 1-2 minutes delay?!
• Can we define that all batch data can live well on Big Data
technologies?
• Will we see at the end (10 years form now) that only small
portion of data still resides on RDBMS and most of the data
resides on Big Data technologies?!


Big data challenges

• NLP in Hebrew (entity recognition is more difficult)
• Adapting analytical algorithms to match big data world
(Anomaly detection needs to be redefined)
• Some problem with consistency
• Skiils problem – BI needs to program in Java, Hadoop,
NoSQL knowledge


Example of big data technology: SPLUNK

• Splunk is a traditional IT vendor based on MapReduce
(from 2009)


Thanks for your patience and hope you enjoyed

Here you can find the latest version of this presentation http://www.slideshare.net/pini

55

Big data 2012 v1

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Big data 2012 v1

Similar to Big data 2012 v1 (20)

More from Pini Cohen

More from Pini Cohen (20)

Recently uploaded

Recently uploaded (20)

Big data 2012 v1