SlideShare a Scribd company logo
What does it take
to make Google
work at scale?
Malte Schwarzkopf
Cambridge Systems at Scale
http://camsas.org – @CamSysAtScale
CamSa
S
CamSa
S
2http://camsas.org @CamSysAtScale
CamSa
S
3http://camsas.org @CamSysAtScale
CamSa
S
4http://camsas.org @CamSysAtScale
What happens here?
CamSa
S
5http://camsas.org @CamSysAtScale
What happens in those 139ms?
Internet
Google
datacenter
Your
computer
Front-end web
server
G+ idx
places
idx
web idx
? ?
?
? ?
?
CamSa
S
6http://camsas.org @CamSysAtScale
What we’ll chat about
1.Datacenter hardware
2.Scalable software stacks
a.Google
b.Facebook
3.Differences compared to HPC
4.Current research directions
Please ask questions - any
time! :)
CamSa
S
7http://camsas.org @CamSysAtScale
CamSa
S
8http://camsas.org @CamSysAtScale
CamSa
S
9http://camsas.org @CamSysAtScale
CamSa
S
10http://camsas.org @CamSysAtScale
CamSa
S
11http://camsas.org @CamSysAtScale
10,000+ machines
“Machine”
no chassis!
DC battery
~6 disk drives
mostly custom-made
Network
ToR switch
multi-path core
e.g., 3-stage fat tree or
CamSa
S
12http://camsas.org @CamSysAtScale
How’s this different to HPC?
Emphasis on commodity hardware
No expensive interconnect
Mid-range machines
Energy/performance/cost trade-off essential
Massive automation
Very small number of on-site staff
Automated software bootstrap
Fault tolerant design
Each component can fail
Software must be aware and compensate
CamSa
S
13http://camsas.org @CamSysAtScale
The software side
Machine MachineMachine
Linux kernel
(customized)
Linux kernel
(customized)
Linux kernel
(customized)
Transparent distributed systems
Containers
Containers
We’ll look at what goes here!
CamSa
S
14http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSa
S
15http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSa
S
16http://camsas.org @CamSysAtScale
GFS/Colossus
Bulk data block storage system
Optimized for large files (GB-size)
Supports small files, but not common case
Read, write, append modes
Colossus = GFSv2, adds some improvements
e.g., Reed-Solomon-based erasure coding
better support for latency-sensitive applications
sharded meta-data layer, rather than single master
CamSa
S
17http://camsas.org @CamSysAtScale
Application
GFS/Colossus: architecture
Chunkserver
GFS library
Chunkserver
ChunkserverGFS Master
Namespace
Metadata
CamSa
S
18http://camsas.org @CamSysAtScale
GFS/Colossus: writes
Application
GFS Master
Chunkserver
GFS library
Chunkserver
Chunkserver
Namespace
Metadata
Chain replication
CamSa
S
19http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSa
S
20http://camsas.org @CamSysAtScale
BigTable (2006)
• ‘Three-dimensional‘ key-value store:
– <row key, column key, timestamp> → value
• Effectively a distributed, sorted, sparse map
row
(key: string)
column
(key: string)
timestamp
(key: int64)
cell
<“larry.page“, “websearch“, 133746428> → “cat pictures“
CamSa
S
21http://camsas.org @CamSysAtScale
BigTable (2006)
• Distributed tablets (~1 GB) hold subsets of map
• On top of Colossus, which handles replication and
fault tolerance: only one (active) server per tablet!
• Reads & writes within a row are transactional
– Independently of the number of columns touched
– But: no cross-row transactions possible
– Turns out users find this hard to deal with
CamSa
S
22http://camsas.org @CamSysAtScale
BigTable (2006)
• META0 tablet is “root“ for name resolution
– Like a coordinator/leader
– Nested META1 tablets improve meta-data lookup
• Elect master (META0 tablet server) using
Chubby, and to maintain list of tablet servers &
schemas
– 5-way replicated Paxos consensus on data in Chubby
• Built atop GFS; building block for MegaStore
– “Next gen” with transactions: Spanner
CamSa
S
23http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSa
S
24http://camsas.org @CamSysAtScale
Spanner (2012)
• BigTable insufficient for some consistency needs
• Often have transactions across >1 data centres
– May buy app on Play Store while travelling in the U.S.
– Hit U.S. server, but customer billing data is in U.K.
– Or may need to update several replicas for fault tolerance
• Wide-area consistency is hard
– due to long delays and clock skew
– no global, universal notion of time
– NTP not accurate enough, PTP doesn’t work (jittery links)
CamSa
S
25http://camsas.org @CamSysAtScale
Spanner (2012)
• Spanner offers transactional consistency: full RDBMS
power, ACID properties, at global scale!
• Secret sauce: hardware-assisted clock sync
– Using GPS and atomic clocks in data centres
• Use global timestamps and Paxos to reach consensus
– Still have a period of uncertainty for write TX: wait it out!
– Each timestamp is an interval:
Definitely in
the future
tt.latest
Definitely in
the past
tt.earliest
tabs
CamSa
S
26http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSa
S
27http://camsas.org @CamSysAtScale
MapReduce (2004)
• Parallel programming framework for scale
– Run a program on 100’s to 10,000’s machines
• Framework takes care of:
– Parallelization, distribution, load-balancing, scaling up
(or down) & fault-tolerance
• Accessible: programmer provides two methods ;-)
– map(key, value) → list of <key’, value’> pairs
– reduce(key’, value’) → result
– Inspired by functional programming
CamSa
S
28http://camsas.org @CamSysAtScale
MapReduce
Input
Map
Reduce
Output
Shuffle
X: 5 X: 3 Y: 1 Y: 7
Y: 8X: 8
Results: X: 8, Y: 8
Perform Map() query against local data
matching input specification
Aggregate gathered results for each
intermediate key using Reduce()
End user can query results via
distributed key/value store
Slide originally due to S. Hand’s distributed systems lecture course at Cambridge:
http://www.cl.cam.ac.uk/teaching/1112/ConcDisSys/DistributedSystems-1B-H4.pdf
CamSa
S
29http://camsas.org @CamSysAtScale
MapReduce: Pros & Cons
• Extremely simple, and:
– Can auto-parallelize (since operations on every
element in input are independent)
– Can auto-distribute (since rely on underlying
Colossus/BigTable distributed storage)
– Gets fault-tolerance (since tasks are idempotent, i.e.
can just re-execute if a machine crashes)
• Doesn’t really use any sophisticated distributed
systems algorithms (except storage replication)
• However, not a panacea:
– Limited to batch jobs, and computations which are
expressible as a map() followed by a reduce()
CamSa
S
30http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSa
S
31http://camsas.org @CamSysAtScale
Dremel (2010)
Column-oriented store
For quick, interactive queries
...
Row-oriented storage
Column-oriented storage
CamSa
S
32http://camsas.org @CamSysAtScale
Dremel (2010)
Stores protocol buffers
Google’s universal serialization format
Nested messages → nested columns
Repeated fields → repeated records
Efficient encoding
Many sparse records: don’t store NULL fields
Record re-assembly
Need to put results back together into records
Use a Finite State Machine (FSM) defined by
protocol buffer structure
CamSa
S
33http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
CamSa
S
34http://camsas.org @CamSysAtScale
Borg/Omega
Cluster manager and scheduler
Tracks machine and task liveness
Decides where to run what
Consolidates workloads onto machines
Efficiency gain, cost savings
Need fewer clusters
CamSa
S
35http://camsas.org @CamSysAtScale
BorgMaster
Borg/Omega
BorgMasterBorgMaster
Link shard
Borglet Borglet Borglet
Paxos-replicated
persistent store
Scheduler
Figure reproduced after A. Verma et al., “Large-scale cluster management at
Google with Borg”, Proceedings of EuroSys 2015.
CamSa
S
36http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Jobs/tasks: counts
CPU/RAM: resource seconds [i.e. resource * job runtime in sec.]
Cluster A
Medium size
Medium utilization
Cluster B
Large size
Medium utilization
Cluster C
Medium (12k mach.)
High utilization
Public trace
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large
compute clusters”, Proceedings of EuroSys 2013.
CamSa
S
37http://camsas.org @CamSysAtScale
Borg/Omega: workloads
FractionofjobsrunningforlessthanX
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large
compute clusters”, Proceedings of EuroSys 2013.
Service jobs run for much longer than batch jobs:
long-term user-facing services vs. one-off analytics.
CamSa
S
38http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Fractionofinter-arrivalgapslessthanX
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large
compute clusters”, Proceedings of EuroSys 2013.
Batch jobs arrive more frequently than service jobs:
more numerous, shorter duration, fail more.
CamSa
S
39http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Batch jobs have a longer-tailed CPI distribution:
lower scheduling priority in kernel scheduler.
Figures from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSa
S
40http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Service workloads access memory more frequently:
larger working sets, less I/O.
Figures from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
CamSa
S
41http://camsas.org @CamSysAtScale
How’s this different to HPC?
Generally avoid tight synchronization
No barrier-syncs!
Asynchronous replication, eventual consistency
Much more data-intensive
Lower compute-to-data ratio
Aggressive resource sharing for utilization
CamSa
S
42http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSa
S
43http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSa
S
44http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
CamSa
S
45http://camsas.org @CamSysAtScale
Haystack & f4
Blob stores, hold photos, videos
not: status updates, messages, like counts
Items have a level of hotness
How many users are currently accessing this?
Baseline “cold” storage: MySQL
Want to cache close to users
Reduces network traffic
Reduces latency
But cache capacity is limited!
Replicate for performance, not resilience
CamSa
S
46http://camsas.org @CamSysAtScale
What about
other companies’ stacks?
CamSa
S
47http://camsas.org @CamSysAtScale
How about other companies?
Very similar stacks.
Microsoft, Yahoo, Twitter all similar in principle.
Typical set-up:
Front-end serving systems and fast back-ends.
Batch data processing systems.
Multi-tier structured/unstructured storage hierarchy.
Coordination system and cluster scheduler.
Minor differences owed to business focus
e.g., Amazon focused on inventory/shopping cart.
CamSa
S
48http://camsas.org @CamSysAtScale
Open source software
Lots of open reimplementations!
MapReduce → Hadoop, Spark, Metis
GFS → HDFS
BigTable → HBase, Cassandra
Borg/Omega → Mesos, Firmament
But also some releases from companies…
Presto (Facebook)
Kubernetes (Google)
CamSa
S
49http://camsas.org @CamSysAtScale
Current research
Many hot topics!
This distributed infrastructure is still new.
Make many-core machines go further
Increase utilization, schedule better.
Deal with co-location interference.
Rapid insights on huge datasets
Fast, incremental stream processing, e.g. Naiad.
Approximate analytics, e.g. BlinkDB.
Distributed algorithms
New consensus systems, e.g. RAFT.
CamSa
S
50http://camsas.org @CamSysAtScale
Conclusions
1.Running at huge (10k+ machines) scale
requires different software stacks.
2.Pretty interesting systems
a.Do read the papers!
b.(Partial) Bibliography:
http://malteschwarzkopf.de/research/assets/google-stack.pdf
http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
1.Lots of activity in this area!
a.Open source software
b.Research initiatives
Cambridge
Systems at Scale
@CamSysAtScale
camsas.org

More Related Content

What's hot

Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
rhatr
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
bigdatalondon
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
liqiang xu
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
Andrzej Grzesik
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
March 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling HadoopMarch 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling Hadoop
Yahoo Developer Network
 
Voldemort
VoldemortVoldemort
Voldemort
fasiha ikram
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
Paco Nathan
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
Harika583
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon
 

What's hot (20)

Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
March 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling HadoopMarch 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling Hadoop
 
Voldemort
VoldemortVoldemort
Voldemort
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 

Viewers also liked

140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
Data Con LA
 
Hadoop, hive和scribe在运维方面的应用
Hadoop, hive和scribe在运维方面的应用Hadoop, hive和scribe在运维方面的应用
Hadoop, hive和scribe在运维方面的应用
xshadowxc
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
baggioss
 
Realtime statistics using Java, Kafka and Graphite
Realtime statistics using Java, Kafka and GraphiteRealtime statistics using Java, Kafka and Graphite
Realtime statistics using Java, Kafka and Graphite
Hung Nguyen
 
Qcon
QconQcon
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
Samuel Rash
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 

Viewers also liked (8)

140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Hadoop, hive和scribe在运维方面的应用
Hadoop, hive和scribe在运维方面的应用Hadoop, hive和scribe在运维方面的应用
Hadoop, hive和scribe在运维方面的应用
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
Realtime statistics using Java, Kafka and Graphite
Realtime statistics using Java, Kafka and GraphiteRealtime statistics using Java, Kafka and Graphite
Realtime statistics using Java, Kafka and Graphite
 
Qcon
QconQcon
Qcon
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 

Similar to What does it take to make google work at scale

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 
Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
George Ang
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
Arunit Gupta
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
Narendranath Reddy T
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
From PoCs to Production
From PoCs to ProductionFrom PoCs to Production
From PoCs to Production
DataStax
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
Ian Pointer
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
zznate
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
zznate
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
ScyllaDB
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
Amazon Web Services
 

Similar to What does it take to make google work at scale (20)

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
From PoCs to Production
From PoCs to ProductionFrom PoCs to Production
From PoCs to Production
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 

What does it take to make google work at scale