The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
Building Your Own Facebook Real Time Analytics System with Cassandra and GigaSpaces.
Facebook's real time analytics system is a good reference for those looking to build their real time analytics system for big data.
The first part covers the lessons from Facebook's experience and the reason they chose HBase over Cassandra.
In the second part of the session, we learn how we can build our own Real Time Analytics system, achieve better performance, gain real business insights, and business analytics on our big data, and make the deployment and scaling significantly simpler using the new version of Cassandra and GigaSpaces Cloudify.
Distributed Operating System,Network OS and Middle-ware.??Abdul Aslam
Define Distributed Operating System, Network Operating System and Middle-ware? Differentiate between DOS, NOS and Middle-ware? Define the goals of each? ???
This presentation briefly discusses about the following topics:
Data Analytics Lifecycle
Importance of Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Data Analytics Lifecycle Example
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
Building Your Own Facebook Real Time Analytics System with Cassandra and GigaSpaces.
Facebook's real time analytics system is a good reference for those looking to build their real time analytics system for big data.
The first part covers the lessons from Facebook's experience and the reason they chose HBase over Cassandra.
In the second part of the session, we learn how we can build our own Real Time Analytics system, achieve better performance, gain real business insights, and business analytics on our big data, and make the deployment and scaling significantly simpler using the new version of Cassandra and GigaSpaces Cloudify.
Distributed Operating System,Network OS and Middle-ware.??Abdul Aslam
Define Distributed Operating System, Network Operating System and Middle-ware? Differentiate between DOS, NOS and Middle-ware? Define the goals of each? ???
This presentation briefly discusses about the following topics:
Data Analytics Lifecycle
Importance of Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Data Analytics Lifecycle Example
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
4. Example*: Facebook
•
•
•
•
•
•
2.5B – content items shared
2.7B – ‘Likes’
300M – photos uploaded
105TB – data scanned every 30 minutes
500+TB – new data ingested
100+PB – data warehouse
* VP Engineering, Jay Parikh – 2012
5. Example: Facebook’s Haystack*
• 65B photos
– 4 images of different size stored for each photo
– For a total of 260B images and 20PB of storage
• 1B new photos uploaded each week
– Increment of 60TB
• At peak traffic, 1M images served per second
• An image request is like finding a needle in a
haystack
*Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in
Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems
design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.
6. More Examples
• The LHC at CERN generates 22PB of data annually
(after throwing away around 99% of readings)
• The Square Kilometre Array (under construction)
is expected to generate hundreds of PB each day
• Farecast, a part of Bing, searches through 225B
flight and price records to advise customers on
their ticket purchases
7. More Examples (2)
• The amount of annual traffic flowing over the
Internet is around 700EB
• Walmart handles in excess of 1M transactions
every hour (25PB in total)
• 400M Tweets everyday
8. Big Data
• Large datasets whose processing and storage
requirements exceed all traditional paradigms and
infrastructure
– On the order of terabytes and beyond
• Generated by web 2.0 applications, sensor networks,
scientific applications, financial applications, etc.
• Radically different tools needed to record, store,
process, and visualize
• Moving away from the desktop
• Offloaded to the “cloud”
• Poses challenges for computation, storage, and
infrastructure
9. The Stack
• Presentation layer
• Application layer: processing + storage
• Operating System layer
• Virtualization layer (optional)
• Network layer (intra- and inter-data center)
• Physical infrastructure layer
Can roughly be called the “cloud”
10. Presentation Layer
• Acts as the user-facing end of the entire
ecosystem
• Forwards user queries to the backend
(potentially the rest of the stack)
• Can be both local and remote
• For most web 2.0 applications, the
presentation layer is a web portal
11. Presentation Layer (2)
• For instance, the Google search website is a
presentation layer
– Takes user queries
– Forwards them to a scatter-gather application
– Presents the results to the user (within a time
bound)
• Made up of many technologies, such as HTTP, HTML,
AJAX, etc.
• Can also be a visualization library
12. Application Layer
• Serves as the back-end
• Either computes a result for the user, or
fetches a previously computed result or
content from storage
• The execution is predominantly distributed
• The computation itself might entail crossdisciplinary (across sciences) technology
13. Processing
• Can be a custom solution, such as a scattergather application
• Might also be an existing data intensive
computation framework, such as MapReduce,
Spark, MPI, etc. or a stream processing
system, such as IBM Infosphere Streams,
Storm, S4, etc.
• Analytics engines: R, Matlab, etc.
14. Numbers Everyone Should Know*
Operation
Time (nsec)
Time
L1 cache reference
0.5
0.5s
Branch mispredict
5
5s
L2 cache reference
7
7s
Mutex lock/unlock
25
25s
Main memory reference
100
1m40s
Send 2K over 1Gbps network
20,000
5h30m
Read 1MB sequentially from memory
250,000
~3days
Disk seek
10,000,000
~6days
Read 1MB sequentially from disk
20,000,000
8months
Send packet CA -> NL -> CA
150,000,000
4.75years
* Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from
LADIS, 2009.
15. Ubiquitous Computation: Machine
Learning
• Making predictions based on existing data
• Classifying emails into spam and non-spam
• American Express analyzes the monthly
expenditures of its cardholders to suggest
products to them
• Facebook uses it to figure out the order of
Newsfeed stories, friend and page
recommendations, etc.
• Amazon uses it to make product
recommendations while Netflix employs it for
movie recommendations
16. Case Study: MapReduce
• Designed by Google to process large amounts of data
– Google’s “hammer for 80% of their data crunching”
– Original paper has 9000+ citations
• The user only needs to write two functions
• The framework abstracts away work distribution,
network connectivity, data movement, and
synchronization
• Can seamlessly scale to hundreds of thousands of
machines
• Open-source version, Hadoop, being used by everyone,
from Yahoo and Facebook to LinkedIn and The New
York Times
17. Case Study: MapReduce (2)
• Used for embarrassingly parallel applications,
most divide-and-conquer algorithms
• For instance, the count of each word in a
billion document library can be calculated in
less than 10 lines of custom code
• Data is stored on a distributed filesystem
• map() -> groupBy -> reduce()
18. Case Study: Storm
• Used to analyze “data in motion”
– Originally designed at Backtype but later acquired by
Twitter; now an Apache source project
• Each datapoint, called a tuple, passes through a
processing pipeline
Source (spout)
Operator(s) (bolt)
Sink
• The user only needs to provide the code for each
operator and a graph specification (topology)
19. Storage
• Most Big Data solutions revolve around data
without any structure (possibly from
heterogeneous sources)
• The scale of the data makes a cleaning phase next
to impossible
• Therefore, storage solutions need to explicitly
support unstructured and semi-structured data
• Traditional RDBMS being replaced by NoSQL and
NewSQL solutions
– Varying from document stores to key-value stores
20. Storage (2)
1. Relational database management systems (RDBMS):
IBM DB2 MySQL, Oracle DB, etc. (structured data)
2. NoSQL: Key-value stores, document stores, graphs,
tables, etc. (semi-structured and unstructured data)
–
–
–
–
Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Tables: BigTable, HBase, etc.
3. NewSQL: The best of both worlds: Spanner, VoltDB,
etc.
21. NoSQL
• Different Semantics:
– RDBMS provide ACID semantics:
•
•
•
•
•
Atomicity: The entire transaction either succeeds or fails
Consistent: Data within the database remains consistent after each
Transaction
Isolation: Transactions are sandboxed from each other
Durable: Transactions are persistent across failures and restarts
– Overkill in case of most user-facing applications
– Most applications are more interested in availability and willing
to sacrifice consistency leading to eventual consistency
• High Throughput: Most NoSQL databases sacrifice
consistency for availability leading to higher throughput (in
some cases an order of magnitude)
22. Case Study: BigTable*
• Distributed multi-dimensional table
• Indexed by both row-key as well as column-key
• Rows are maintained in lexicographic order and
are dynamically partitioned into tablets
• Implemented atop GFS
• Multiple tablet servers and a single master
* Fay Chang, et al. 2006. Bigtable: a distributed storage system for structured data. In
Proceedings of the 7th symposium on Operating systems design and implementation
(OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.
23. Case Study: Spanner*
• A database that stretches across the globe, seamlessly
operating across hundreds of datacenters and millions
of machines, and trillions of rows of information
• Took Google 4 and a half years to design and develop
• Time is of the essence in distributed systems; (possibly
geo-distributed) machines, applications, processes, and
threads need to be synchronized
* James C. Corbett, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of
the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX
Association, Berkeley, CA, USA, 251-264.
24. Case Study: Spanner (2)
• Spanner consists of a “TrueTime API”, which
makes use of atomic clocks and GPS!
• Ensures consistency for the entire system
• Even if two commits (with agreed upon ordering)
take place at other ends of the globe (say US and
China), their ordering will be preserved
• For instance, the Google ad system (an online
auction where ordering matters) can span the
entire globe
26. Cluster Managers
• Mix different programming paradigms
– For instance, batch-processing with stream-processing
• Cluster consolidation
– No need to manually partition cluster across multiple frameworks
• Data sharing
– Pass data from, say, MapReduce to Storm and vice versa
• Higher level job orchestration
– The ability to have a graph of heterogeneous job types
• Examples include YARN, Mesos, and Google’s Omega
27. Operating System Layer
• Consists of the traditional operating system
stack with the usual suspects, Windows,
variants of *nix, etc.
• Alternatives exist though. Specialized for the
cloud or multicore systems
• Exokernels, multikernels, and unikernels
28. Virtualization Layer
• Allows multiple operating systems to run on top
of the same physical hardware
• Enables infrastructure sharing, isolation, and
optimized utilization
• Different allocation strategies possible
• Easier to dedicate CPU and memory but not the
network
• Allocation either in the form of VMs or containers
• VMWare, Xen, LXC, etc.
29. Network Layer
•
•
•
•
•
Connects the entire ecosystem together
Consists of the entire protocol stack
Tenants assigned to Virtual LANs
Multiple protocols available across the stack
Most datacenters employ traditional Ethernet as the L2
fabric, although optical, wireless, and Infiniband are
not far-fetched
• Software Defined Networks have also enabled more
informed traffic engineering
• Run-of-the-mill tree topologies being replaced by
radical recursive and random topologies
30. Physical Infrastructure Layer
•
•
•
•
The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different
interconnects
• Dubbed as datacenters
• Modular and self-containing, container-sized datacenters can be
moved at will
• “We must treat the datacenter itself as one massive warehousescale computer” – Luiz André Barroso and Urs Hölzle, Google*
* Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a Computer: An Introduction to
the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.
31. Power Generation
• According to the New York Times in 2012,
datacenters are collectively responsible for the
energy equivalent of 7-10 nuclear power
plants running at full capacity
• Datacenters have started using renewable
energy sources, such as solar and wind power
• Engendering the paradigm of “move
computation wherever renewable sources
exist”
32. Heat Dissipation
• The scale of the set up necessitates radical cooling
mechanisms
• Facebook in Prineville, US, “the Tibet of North
America”
– Rather than use inefficient water chillers, the datacenter
pulls the outside air into the facility and uses it to cool
down the servers
• Google in Hamina, Finland, on the banks of the Baltic
Sea
– The cooling mechanism pulls sea water through an
underground tunnel and uses it to cool down the servers
33.
34. Case Study: Google
• All that infrastructure enables Google to:
– Index 20B web pages a day
– Handle in excess of 3B search queries daily
– Provide email storage to 425M Gmail users
– Serve 3B YouTube videos a day