All this data will end in the IT system of some company, and
they will make money from it
“Big data is the new oil”
It’s not only about data : there will be new usages, new
services… new competitors !
Sooner or later, every company will face the problematics the
web giants had to face
They measured everything :
! Power efficiency of all hardware parts
! Performance to power ratio, $ per transaction, etc.
! Cost models of failures
For them : Commodity hardware is 3 to 12 times cheaper
Start to design datacenters only based on commodity hw
Start to design application distributed on thousands of non
Small is beautiful, but…
Web giants are the champions of infrastructure automation, that’
why they became champions of the cloud
Need to completely redefine application resilience, since the
hardware is not reliable, and constantly fails.
Having to deploy on many machines changes everything : you
need to automate things
Resilience must be handled by software. Especially for
! « Not Only SQL »
! To go beyond RDBMS limitations
Google : BigTable
Amazon : DynamoDB
Facebook : Cassandra, sharded key-value mysql
LinkedIn : Voldemort
The need for speed
… and availability
100ms of degradation of latency
more than 500ms in page load
more than 400ms in page load
more than 1s in page load
Amazon: 1 min of unavailability
-1% of revenues
-20% of page views
+5 to 9% of bounce
-2.8% in ad revenues
50 K$ of revenue loss
(The blink of an eye is 300 ms)
Les géants du Web
New storage architectures and the CAP theorem
« Availability »
Users can access the system
(read or write)
A is also related to response time.
The more you look for consistency,
the worst will be the latency
Large websites use
only two !
« Consistency »
All users have the same
version of information
« Partition tolerance »
The system continues to work in case
of network partition, ie. when different
nodes cannot communicate
A radically different approach to database
Distributed storage, tolerating failure by replicating data
Consistency constraint is relaxed : eventual consistency
Focus is put on availability and low response times (low latency)
Linear horizontal scalability
Variety of datamodels
! column oriented
Different sharding approaches
! BigTable, with the distributed storage file system GFS
! Famous paper about Dynamo, key/value store organised in a ring
of replication with consistent hashing, and original approach to
! Cassandra, inspired form both BigTable, and Dynamo
! also : specific design of a sharded mysql used as key/value store
Exponential growth of capacities
CPU, memory, network bandwith, storage … all of them followed the Moore’s law
We can store 100’000 times more data, but it takes 1000 times longer to read it !
Google paper : Map Reduce
Monitoring and Management
Overview of Hadoop architecture
A new way of doing BI and data analytics
Consider that all the data is valuable, and store everything :
structured and un-structured data
Scale to peta-bytes of storage, at a low cost
! Yahoo has a cluster of 42’000 nodes
Don’t force the data to match a predefined data model (tables
and schema), instead use a “schema-on-read” approach
Don’t move the data (ETL) to process it, instead move the
processing to the data (Map-Reduce)
Build vs. Buy
a competitive advantage
Common to all companies in a sector
Perceived as an advantage for
Common to all companies
Perceived as a resource
They use and contribute massively to open source
Facebook : MySQL, Cassandra, Thrift, open compute (open
source hardware and datacenter design)…
Google : android, GWT, chromium, linux kernel…
! through their papers : GFS, MapReduce
LinkedIn : Voldemort, Kafka, Zoie …
NetFlix : a huge list of software…
I trust software I hacked myself
A way to expose services of
applications, to be re-used by
others to build and enrich their
own services and applications
Be a platform from the beginning
Memo de Jeff Bezos (2002)
1) All teams will expose their data and functionality through service
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed: no
direct linking, no direct reads of another team’s data store, no sharedmemory model, no back-doors whatsoever. The only communication
allowed is via service interface calls over the network.
4) It doesn’t matter what technology they use. HTTP, Corba, Pubsub,
custom protocols — doesn’t matter. Bezos doesn’t care.
5) All service interfaces, without exception, must be designed from the
ground up to be externalizable. That is to say, the team must plan and
design to be able to expose the interface to developers in the outside world.
6) Anyone who doesn’t do this will be fired.
7) Thank you; have a nice day!
Open API : advantages to do it
! enrich your service portfolio and business opportunities with many
Do bigger things by using « collective intelligence of the world »
Create an ecosystem around you
Improve the quality
! If you want your APIs to be used,
! Companies of the world are looking at what you are doing à it
brings pressure on you to improve
Attract talented people
! The best way to attract good developers : they will want to come
and work with those who created these APIs
We try things. We celebrate our failures.
This is a company where it is absolutely OK
to try something that is very hard, have it not be
successful, take the learning and apply it to
former Google’s CEO
Move fast and break things
Failure is totally OK.
As long as you fail fast
The minimum viable product
is that version of a new product
which allows a team to collect the
maximum amount of validated
learning about customers with the
pioneer of Lean Startup
Infra as Code : Industrialize and Automate everything
test driven infrastructure !
Continuous Delivery : a pipeline to bring code to production
Tools and practices
! Continuous integration
! TDD - Test Driven Development
(automated unit testing)
! Code reviews
! Continuous code auditing (sonar…)
! Functional test automation
! Strong non-functional tests
! Automated packaging and deployment,
independent of target environment
! Zero downtime deployment
Push code to production != push a feature to production
Enable/ Disable a new feature on production in seconds
“Graceful degradation” during peaks of traffic
Can be used for A/B testing !
Datamodel evolution strategy example
V.1 + V.2
Dark Launch @ Facebook
We chose to simulate the impact of
many real users hitting many machines by
means of a “dark launch” period in which
Facebook pages would make connections to
the chat servers, query for presence
information and simulate message sends
without a single UI element drawn
on the page.
IT’S A LOAD TEST ON A PRODUCTION PLATFORM !
Hystrix is a latency and fault tolerance library
designed to isolate points of access to remote
systems, services and 3rd party libraries, stop
cascading failure and enable resilience in complex
distributed systems where failure is inevitable.
In God we trust.
All others must bring data
W. Edwards Deming
Everyone must be able to experiment, learn and iterate.
Position, obedience and tradition should not hold no power.
For innovation to ﬂourish, measurement must rule.
Werner Vogels ,
CTO of Amazon
They measure everything
infrastructure, from datacenter to HDD power consumption
operational processes efficiency
self-service restaurant queue length !
management practices (Google)
Good ideas come from the field, from real data, because
managers always have biases when they try to interpret
Best size for teams