Yahoo uses Apache Hadoop extensively to power many of its products and services. Hadoop allows Yahoo to gain insights from massive amounts of data, including user data from services like Flickr and Yahoo Mail. Yahoo has contributed over 70% of the code to the Apache Hadoop project to date. Hadoop is critical to Yahoo's business by enabling personalization, spam filtering, content optimization, and other data-driven features. Yahoo runs Hadoop on tens of thousands of servers storing over 100 petabytes of data. The company continues working to enhance Hadoop's scalability, flexibility, and performance to make it more suitable for enterprise use.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how marketing, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We'll also cover examples of how to measure buyer sentiment and changes in buyer behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center, and retail branches can use to customize more compelling products and promotions.
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
Key insights in installing, configuring, and running Hadoop and Cloudera's Distribution for Hadoop in production. These are lessons learned from Cloudera helping organizations move to a productions state with Hadoop.
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how marketing, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We'll also cover examples of how to measure buyer sentiment and changes in buyer behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center, and retail branches can use to customize more compelling products and promotions.
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
Key insights in installing, configuring, and running Hadoop and Cloudera's Distribution for Hadoop in production. These are lessons learned from Cloudera helping organizations move to a productions state with Hadoop.
Calpont CTO Jim Tommaney provides an overview InfiniDB 3, Calpont’s analytic data platform.
Discussion Topics
•How InfiniDB is architected for Big Data analytics
•How InfiniDB is provisioned for Amazon EC2 with an AMI
•How to quickly create a small or large cluster
•How InfiniDB’s parallel load capabilities deliver linear load scaling
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM
Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.
Introduction to Hortonworks Data Platform for WindowsHortonworks
According to IDC, Windows Servers run more than 50% of the servers in the Enterprise Data Center. Hortonworks has worked closely with Microsoft to port Apache Hadoop to Windows to enable organizations to take advantage of this emerging Big Data technology. Join us in this informative webinar to hear about the new Hortonworks Data Platform for Windows.
In less than an hour, you’ll learn:
-Key capabilities available in Hortonworks Data Platform for Windows
-How HDP for Windows integrates with Microsoft tools
-Key workloads and use cases for driving Hadoop today
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
In this one-hour webinar, Caserta Concepts and Talend described an approach to achieve an architectural framework and roadmap to extend a traditional enterprise data warehouse environment, into a Big Data ecosystem.
They illustrated the architectural components involved for collecting, analyzing and delivering Big Data, with a focus on the importance of Hadoop, Data Integration, Machine Learning, NoSQL, Business Intelligence and Analytics.
Attendees learned:
Which Big Data technologies can’t be ignored
Considerations when extending the data ecosystem
What happens to your existing investment
What are the points of integration
Does Big Data = better data?
To find access the recorded webinar or to learn more, visit http://www.casertaconcepts.com/.
Presentation: Overview of Kognitio, Kognitio Cloud and the Kognitio Analytical Platform
Kognitio is driving the convergence of Big Data, in-memory analytics and cloud computing. Having delivered the first in-memory analytical platform in 1989, it was designed from the ground up to provide the highest amount of scalable compute power to allow rapid execution of complex analytical queries without the administrative overhead of manipulating data. Kognitio software runs on industry-standard x86 servers, or as an appliance, or in Kognitio Cloud, a ready-to-use analytical platform. Kognitio Cloud is a secure, private or public cloud Platform-as-a-Service (PaaS), leveraging the cloud computing model to make the Kognitio Analytical Platform available on a subscription basis. Clients span industries, including market research, consumer packaged goods, retail, telecommunications, financial services, insurance, gaming, media and utilities.
To learn more, visit www.kognitio.com and follow us on Facebook, LinkedIn and Twitter.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool – either Hadoop or a traditional DBMS – to do all the work. At Vertica, we’ve found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
hadoop @ Ibmbigdata
1. YAHOO &
HADOOP
USING
AND
IMPROVING
APACHE
HADOOP
AT
YAHOO!
Eric Baldeschwieler
VP, Hadoop Software
2. AGENDA
•
Brief
Overview
•
Hadoop
@
Yahoo!
• Hadoop
Momentum
• The
Future
of
Hadoop
2
3. WHAT’S
happening
-‐
Big
Data
is
here!
-‐ unstructured data
-‐
petabyte scale
-‐
operationally critical
Flickr : sub_lime79
4. TURNING DATA
INTO INSIGHTS
machine learning
logic regression time series
content clustering
algorithms ad inventory modeling
user interest prediction
factorization models
Flickr : NASA Goddard Photo and Video
6. HADOOP:
POWERING
YAHOO!
science
+
big
data + insight =
personal relevance = VALUE
Flickr : DDFic
7. WHAT IS HADOOP?
Commodity
Pig Hive Programming Languages
• Computers
• Network
MapReduce Computation
Focus on
• Simplicity
HDFS
• Redundancy
Storage
• Scale
• Availability
Transforms commodity equipment into a service that:
• HDFS – Stores peta bytes of data reliably
• Map-Reduce – Allows huge distributed computations
Key Attributes
• Redundant and reliable – Doesn’t stop or loose data even as hardware fails
• Easy to program – Our rocket scientists use it directly!
• Very powerful – Allows the development of big data algorithms & tools 7
• Batch processing centric
8. WHAT HADOOP ISN’T
• A
replacement
for
relaFonal
and
data
warehouse
systems
• A
transacFonal
/
online
/
serving
system
• A
low
latency
or
streaming
soluFon
8
9. HADOOP IN THE ENTERPRISE
Business
Intelligence
ApplicaFons
HADOOP
CLUSTER(S) Data
RDMS
EDW
Marts
InteracFons
TransacFons,
Structured
Data
Semi-‐Structured
or
Un-‐Structured
Data
Web
Logs,
Server
Logs,
Business
Social
Media,
etc…
ApplicaFons
9
11. HADOOP @
YAHOO!
“Where
Science
meets
Data”
PRODUCTS
Data Analytics
Content Optimization
Content Enrichment
Yahoo! Mail Anti-Spam
Advertising Products
HADOOP CLUSTERS Ad Optimization
Tens of thousands of servers Ad Selection
Big Data Processing & ETL
APPLIED SCIENCE
User Interest Prediction
Ad inventory prediction
Machine learning -
search ranking
Machine learning - ad
targeting
Machine learning - spam
10s of Petabytes filtering
11
12. FROM PROJECT TO
CORE PLATFORM
90 250
80 40K+ Servers
170 PB Storage 200
70
5M+ Monthly Jobs
60 “Behind
every
150
Thousands of Servers
50 Daily
click”
ProducFon
Petabytes
40
Science
100
30
Impact
20
Research
50
10
0 0
2006 2007 2008 2009 2010
12
13. HADOOP POWERS THE
YAHOO! NETWORK
advertising optimization data analytics
machine learning search ranking
advertising data systems Yahoo! Mail anti-spam
audience, ad and search pipelines ad selection
Yahoo! Homepage Content Optimization
ad inventory prediction
user interest prediction
13
14. CASE STUDY
YAHOO! HOMEPAGE
Personalized
for
each
visitor
twice
the
engagement
Result:
twice
the
engagement
Recommended
links
News
Interests
Top
Searches
+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected
14
15. CASE STUDY
YAHOO! HOMEPAGE
• Serving
Maps
SCIENCE »
Machine learning to build ever
• Users
-‐
Interests
HADOOP better categorization models
CLUSTER
• Five
Minute
USER
CATEGORIZATION
ProducLon
BEHAVIOR
MODELS
(weekly)
• Weekly
PRODUCTION
CategorizaLon
HADOOP
»
Identify user interests using
models
SERVING
CLUSTER
Categorization models
MAPS
(every
5
minutes)
USER
BEHAVIOR
SERVING
SYSTEMS ENGAGED
USERS
Build
customized
home
pages
with
latest
data
(thousands
/
second)
15
16. CASE STUDY
YAHOO! MAIL
Enabling
quick
response
in
the
spam
arms
race
• 450M
mail
boxes
• 5B+
deliveries/day
SCIENCE
• AnLspam
models
retrained
every
few
hours
on
Hadoop
PRODUCTION
40%
less
spam
than
Hotmail
and
55%
less
spam
than
Gmail
16
17. YAHOO! & APACHE HADOOP
Yahoo!
has
contributed
70+%
of
Apache
Hadoop
code
to
date
Hadoop
is
not
our
business,
but
Hadoop
is
key
to
our
business
•
Yahoo!
benefits
from
open
source
eco-‐system
around
Hadoop
•
Hadoop
drives
revenue
at
Yahoo!
by
making
our
core
products
be`er
We
need
Hadoop
to
be
rock
solid
•
We
invest
heavily
in
core
Hadoop
development
•
We
focus
on
scalability,
reliability,
availability
We
fix
bugs
before
you
see
them
•
We
run
very
large
clusters
•
We
have
a
large
QA
effort
•
We
run
a
huge
variety
of
workloads
We
are
good
Apache
Hadoop
ciLzens
•
We
contribute
our
work
to
Apache
•
We
share
the
exact
code
we
run
22. MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop
is
far
from
“done”
• Current
implementaFon
is
showing
its
age
• Need
to
address
several
deficiencies
in
scalability,
flexibility,
ease
of
use
&
performance
Yahoo!
is
working
on
Next
GeneraLon
of
Hadoop
• MapReduce:
Rewrite
to
improve
performance;
pluggable
support
for
new
programming
models
• HDFS:
Adding
volumes
to
improve
scalability;
Flush
&
sync
support
for
applicaFons
that
log
to
HDFS
Apache
should
remain
the
hub
of
Hadoop
ecosystem
• Yahoo!
contributes
all
Hadoop
changes
back
to
Apache
Hadoop
• Everyone
benefits
from
shared
neutral
foundaFon
22