The document discusses Google's mission to organize the world's information and make it universally accessible. It provides an overview of Google's history from its early systems using Lego disks to store data to its current large scale data centers. The document discusses Google's challenges in dealing with the ever increasing amounts of data and computational needs required by its services. It outlines Google's strategies for planning for failure, expansion of applications, infrastructure and hardware. Key systems developed by Google to manage large scale data and computing needs include Google File System, MapReduce and BigTable.
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Dipti Borkar
For more deep NoSQL content from Couchbase, check out http://www.couchbase.com/webinars
NoSQL databases have emerged as a better match than relational systems for modern interactive applications, offering cost-effective data management at “Big Data” scale. But there are significant differences between structured and schema-less database technology. What should architects and technical managers know as they explore NoSQL solutions for their teams?
In this workshop you will learn:
- How to evaluate NoSQL (both technical advantages and limitations) as a potential data management approach
- Critical differences between NoSQL and RDBMS for designing, building and running production applications
- Ideal use cases for NoSQL technology and sample reference architectures
Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15.
What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Dipti Borkar
For more deep NoSQL content from Couchbase, check out http://www.couchbase.com/webinars
NoSQL databases have emerged as a better match than relational systems for modern interactive applications, offering cost-effective data management at “Big Data” scale. But there are significant differences between structured and schema-less database technology. What should architects and technical managers know as they explore NoSQL solutions for their teams?
In this workshop you will learn:
- How to evaluate NoSQL (both technical advantages and limitations) as a potential data management approach
- Critical differences between NoSQL and RDBMS for designing, building and running production applications
- Ideal use cases for NoSQL technology and sample reference architectures
Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15.
What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.
Site Reliability Engineering enables agility and stability.
SREs use Software Engineering to automate themselves out of the Job.
My advice, if you want to implement this change in your company is to start with action items, alter your training and hiring, implement error budgets, do blameless postmortems and reduce toil.
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...Ralf Klamma
The Social Requirements Engineering (SRE) Approach to Developing a Large-scale Personal Learning Environment Infrastructure
Effie Lai-Chong Law, Arunangsu Chatterjee, Dominik Renzel and Ralf Klamma
Department of Computer Science, University of Leicester, UK
Chair of Computer Science 5 - Information Systems, RWTH Aachen University, Germany
EC-TEL 2012, Saarbrücken, Germany
September 21, 2012
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Overview of kafka, how it works, components of kafka, use cases.
Kafka at LinkedIn. Download the slides to see animations explaining how the components fit.
This was presented on kafka meetup help on June 11, 2016 @LinkedinBangalore ofc
How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.
Site Reliability Engineering enables agility and stability.
SREs use Software Engineering to automate themselves out of the Job.
My advice, if you want to implement this change in your company is to start with action items, alter your training and hiring, implement error budgets, do blameless postmortems and reduce toil.
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...Ralf Klamma
The Social Requirements Engineering (SRE) Approach to Developing a Large-scale Personal Learning Environment Infrastructure
Effie Lai-Chong Law, Arunangsu Chatterjee, Dominik Renzel and Ralf Klamma
Department of Computer Science, University of Leicester, UK
Chair of Computer Science 5 - Information Systems, RWTH Aachen University, Germany
EC-TEL 2012, Saarbrücken, Germany
September 21, 2012
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Overview of kafka, how it works, components of kafka, use cases.
Kafka at LinkedIn. Download the slides to see animations explaining how the components fit.
This was presented on kafka meetup help on June 11, 2016 @LinkedinBangalore ofc
Advanced Deployment by Jonathan Weiss presented at Scotland on Rails 2009 in Edinburgh. Deployment and Scaling best practices. See more at http://scotlandonrails.com/schedule/28-march/advanced-deployment/
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
1. Woulda, Coulda, Shoulda
The World of Tera, Peta & Exa
Stephen McHenry
Chancellor of Site Reliability Engineering
April 22, 2009
Google Confidential and Proprietary
2. Overview
• Mission Statement
• Some History
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
3. Google’s Mission
To organize the world’s information
and make it universally
accessible and useful
Google Confidential and Proprietary
4. Overview
• Mission Statement
• Some History
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
5. Lego Disk Case
One of our earliest storage systems
Google Confidential and Proprietary
13. Current Data
Center
Google Confidential and Proprietary
14. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
16. How much information is out there?
How large is the Web?
• Tens of billions of documents? Hundreds?
• ~10KB/doc => 100s of Terabytes
Then there’s everything else
• Email, personal files, closed databases, broadcast media, print, etc.
Estimated 5 Exabytes/year (growing at 30%)*
800MB/year/person – ~90% in magnetic media
Web is just a tiny starting point
Source: How much information 2003
Google Confidential and Proprietary
17. Google takes its mission seriously
Started with the Web (html)
Added various document formats
• Images
• Commercial data: ads and shopping (Froogle)
• Enterprise (corporate data)
• News
• Email (Gmail)
• Scholarly publications
• Local information
• Maps
• Yellow pages
• Satellite images
• Instant messaging and VoIP
• Communities (Orkut)
• Printed media
• …
Google Confidential and Proprietary
18. Ever-Increasing Computation Needs
more
Every Google service sees
data
continuing growth in
computational needs
• More queries
More users, happier users more
queries
• More data
Bigger web, mailbox, blog, etc.
better
results
• Better results
Find the right information, and
find it faster
Google Confidential and Proprietary
19. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
20. When Your Data Center Reaches 170o F
o
Google Confidential and Proprietary
21. The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Google Confidential and Proprietary
22. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
23. Components of Web Search
Crawling process
Get link from
Crawler (Spider): Expired pages
list
from index
Fetch page
Collects the documents List of
links to
Parses page
• Tradeoff between size and speed
to
explore
extract links
Add URL
• High networking bandwidth requirements Add to queue
• Be gentle to serving hosts while doing it
Indexer:
Generates the index - similar to the back of a book (but big!)
Requires several days on thousands of computers
More than 20 billion web documents
• Web, Images, News, Usenet messages, …
Pre-compute query-independent ranking (PageRank, etc)
Query serving:
Processes user queries
Finding all relevant documents
• Search over tens of Terabytes, 1000s of times/second
Scoring - Mix of query dependent and independent factors
Google Confidential and Proprietary
24. Google Query Serving Infrastructure
Misc. servers
query
Spell checker
Google Web Server
Ad Server
Doc servers
Index servers
I0 I1 I2 IN D0 D1 DM
… …
Replicas
Replicas
I0 I1 I2 IN D0 D1 DM
…
…
I0 I1 I2 IN D0 D1 DM
Doc shards
Index shards
Elapsed time: 0.25s, machines involved: 1000+
Google Confidential and Proprietary
25. Ads System
As challenging as search
• But with some transactional semantics
Problem: find useful ads based on what the user is interested in at that
moment
• A form of mind reading
Two systems
• Ads for search results pages (search for tires or restaurants)
• Ads for web browsing/email (or ‘content ads’)
Extract a contextual meaning from web pages
Do the same thing for data from a gazillion advertisers
Match those up and score them
Do it faster than the original content provider can respond to the web page!
Google Confidential and Proprietary
27. Language Translation (by Machine)
Information is more useful if more people can understand it
Translation is a long-standing, challenging Artificial Intelligence problem
Key insight:
• Transform it into a statistical modeling problem
• Train it with tons of data!
Doubling training corpus size
Chinese-English Arabic-English ~0.5% higher score
Google Confidential and Proprietary
28. Data + CPUs = Playground
Substantial fraction of internet available for processing
Easy-to-use teraflops/petabytes
Cool problems, great fun…
Google Confidential and Proprietary
29. Learning From Data
Searching for Britney Spears…
Google Confidential and Proprietary
30. Query Frequency Over Time
Queries containing “eclipse”
Queries containing “world series”
Queries containing “full moon”
Queries containing “summer olympics”
Queries containing “watermelon”
Queries containing “opteron”
Google Confidential and Proprietary
32. A Simple Challenge For Our Computing Platform
1. Create the world’s largest computing infrastructure
2. Make sure we can afford it
Need to drive efficiency of the computing infrastructure to
unprecedented levels
indices containing more documents
updated more often
faster queries
faster product development cycles
…
Google Confidential and Proprietary
33. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
35. GFS: Google File System
Planning – For unprecedented quantities of data storage & failure(s)
Google has unique FS requirements
• Huge read/write bandwidth
• Reliability over thousands of nodes
• Mostly operating on large data blocks
• Need efficient distributed operations
GFS Usage @ Google
• Many clusters
• Filesystem clusters of up to 5000+ machines
• Pools of 10000+ clients
• 5+ PB Filesystems
• 40 GB/s read/write load in single cluster
• (in the presence of frequent HW failures)
Google Confidential and Proprietary
36. GFS Setup
Replicas
Misc. servers
GFS Master
Client
Masters
GFS Master
Client
Client
C1
C1
C0
C0
C5
…
C2
C3
C2
C5
C5
Machine 2
Machine N
Machine 1
• Master manages metadata
• Data transfers happen directly between clients/
machines
Google Confidential and Proprietary
37. MapReduce – Large Scale Processing
Okay, GFS lets us store lots of data… now what?
We need to process that data in new and interesting ways!
• Fast: locality optimization, optimized sorter, lots of tuning work done...
• Robust: handles machine failure, bad records, …
• Easy to use: little boilerplate, supports many formats, …
• Scalable: can easily add more machines to handle more data or reduce the
run-time
• Widely applicable: can solve a broad range of problems
• Monitoring: status page, counters, …
The Plan – Develop a robust compute infrastructure that allows rapid
development of complex analyses, and is tolerant to failure(s)
Google Confidential and Proprietary
38. MapReduce – Large Scale Processing
MapReduce:
• a framework to simplify large-scale computations on large clusters
• Good for batch operations
• User writes two simple functions: map and reduce
• Underlying library/framework takes care of messy details
• Greatly simplifies large, distributed data processing
Lots of uses inside Google
Ads Sawmill (Logs Analysis)
Froogle Search My History
Google Earth Search quality
Google Local Spelling
Google News Web search indexing
Google Print …many other internal projects ...
Machine Translation
Google Confidential and Proprietary
39. Large Scale Processing – (semi) Structured Data
Why not just use commercial DB?
• Scale is too large for most commercial databases
• Even if it weren’t, cost would be very high
Building internally means system can be applied across many projects
for low incremental cost
• Low-level storage optimizations help performance significantly
Much harder to do when running on top of a database layer
Okay, traditional relational databases are woefully
inadequate at this scale… now what?
The Plan – Build a large scale, distributed solution for semi-
structured data, that is resistant to failure(s)
Google Confidential and Proprietary
40. Large Scale Processing – (semi) Structured Data
BigTable:
• A large-scale storage system for semi-structured data
• Database-like model, but data stored on thousands of machines..
• Fault-tolerant, persistent
• Scalable
Thousands of servers
Terabytes of in-memory data
Petabytes of disk-based data
Millions of reads/writes per second, efficient scans
billions of URLs, many versions/page (~20K/version)
Hundreds of millions of users, thousands of queries/sec
100TB+ of satellite image data
• Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance
• Design/initial implementation started beginning of 2004
Google Confidential and Proprietary
41. BigTable Usage
Useful for structured/semi-structured data
URLs - Contents, crawl metadata, links, anchors, pagerank, …
Per-user data - User preference settings, recent queries/search results, …
Geographic data - Physical entities, roads, satellite imagery, annotations, …
Production use or active development for ~70 projects:
Google Print
My Search History
Orkut
Crawling/indexing pipeline
Google Maps/Google Earth
Blogger
…
Currently ~500 BigTable cells
Largest bigtable cell manages ~3000TB of data spread over several
thousand machines (larger cells planned)
Google Confidential and Proprietary
42. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
43. A Simple Challenge For Our Computing Platform
1. Create the world’s largest computing infrastructure
2. Make sure we can afford it
Need to drive efficiency of the computing infrastructure to
unprecedented levels
indices containing more documents
updated more often
faster queries
faster product development cycles
…
Google Confidential and Proprietary
44. Innovative Solutions Needed In Several Areas
Server design and architecture
Power efficiency
System software
Large scale networking
Performance tuning and optimization
System management and repairs automation
Google Confidential and Proprietary
45. Pictorial History
• Brainstorming Circa 2003
• Container-based data centers
• Battery per server instead of traditional
UPS
99.9% efficient backup power!
o
• Application of best practices leads to PUE
below 1.2
Google Confidential and Proprietary
50. Data Center Vitals
• Capacity: 10 MW IT load
• Area: 75000 sq ft total under roof
• Overall power density: 133W/sq ft
• Prototype container delivered January 2005
• Data center built 2004-2005
• Construction completed September, 2005
• Went live November 21, 2005
Google Confidential and Proprietary
51. Additional Vitals
• 45 containers, approx. 40000 servers
• Single and 2-story on facing sides of hangar
• Bridge crane for container handling
Google Confidential and Proprietary
52. Overview
• Mission Statement
• Some History
• The Challenge
• Planning for
• Failure
• Expansion
• Applications
• Infrastructure
• Hardware
• The Future
Google Confidential and Proprietary
53. Planning for the Future
• Manage Total Cost of Ownership
• Reduce Water Usage
• Reduce Power Consumption
• Manage E-Waste
Google Confidential and Proprietary
54. Total Cost of Ownership - TCO
Earnings and sustainability are (often) aligned
• Careful application of best practices leads
to much lower energy use which leads to lower
TCO for facilities – Examples:
Manage air flow - avoid hot/cold mixing
o
Raise the inlet temperature
o
Use free cooling (Belgium has no
o
chillers!)
Optimize power distribution
o
• Don't need exotic technologies
• But: need to break down traditional silos
Between capex and opex
o
Between facilities and IT
o
Manage everyone by impact on TCO
o
Google Confidential and Proprietary
55. Water resources management is the next
quot;elephant in the roomquot; we are all
going to have to address.
Google Confidential and Proprietary
56. A Great Wave Rising:
The coming U.S. crisis in water policy
Lake Powell
53% full
(from ESPN!)
Shasta Lake
Google Confidential and Proprietary
57. Lake Mead water could dry up by 2021*
Lake Mead historical levels
Lake Mead - 45% full
* Scripps Institution of Oceanography, UCSD,
Feb 2008.
Lake Oroville - new docks
Google Confidential and Proprietary
58. Georgia’s Lake Lanier
March 4, 2007 February 11, 2008
Google Confidential and Proprietary
59. Lake Hartwell, GA – November 2008
Google Confidential and Proprietary
60. Water – The Next “Big Elephant”
Why?
• Water resources are becoming (a lot) scarcer and more
variable
How do data centers fit in?
• For every 10 MW consumed, the average data center
uses ~150,000 gallons of water per day for cooling.
• Upstream of the data center, the same 10 MW of
delivered power consumes 480,000 gallons of water per
day to generate that power.
References:
U.S. Dept. of Energy – Energy Demands On Water Resources – Dec., 2006
National Renewable Energy Laboratory - Consumptive Water Use for U.S. Power Production - Dec., 2003
USGS - Water Use At Home - Jan., 2009
Google Confidential and Proprietary
61. Water Consumption (gpd) by DC Type
Factoid: The typical 'water-less' DC uses about a third more water than the evaporatively cooled Google DC
Using less power is the most significant factor for reducing
water consumption
Google Confidential and Proprietary
62. Water Recycling:
Our data center in St. Ghislain, Belgium
Google's data center
in Belgium uses
100% reclaimed
water from an
industrial canal
Google Confidential and Proprietary
63. Power - Cutting waste / Smarter computing
Fact: The typical PC wastes half the electricity it uses
Fact: Over 60% of all corporate PCs are left on overnight
________________________________________________
• End-user devices are the largest portion of IT footprint
• Power efficiency is critical as billions of devices are deployed
• The technology exists today to save energy and money
Buy power efficient laptops / PCs / servers
Google saves $30 per server every year
Enable power management
Power management suites: ROI < 1 year
Transition to lightweight devices
Reduce power from 150W to less than 5W
Potential: 50% emissions reduction
Google Confidential and Proprietary
64. E-waste is a Growing Problem
• Hazardous
• High volume because of
obsolescence
• Ubiquitous (computers,
appliances, consumer
electronics, cell phones) Solutions
• 4 R's: Reduce, reuse,
repair, recycle
• Dispose of remainder
responsibly
Google Confidential and Proprietary
65. Thank you!
Google Confidential and Proprietary