The document summarizes a presentation given by Jeff Hammerbacher on Hadoop and Cloudera. The presentation covered an introduction to Hadoop including HDFS, MapReduce and other subprojects. It discussed how Hadoop is used at large companies like Facebook and Yahoo to manage and analyze large amounts of data. It also provided an overview of Cloudera's Hadoop distribution and services to support enterprises using Hadoop.
Spark is a powerful framework for distributed processing of massive datasets. With an interactive shell, machine learning libraries, and in-memory data structures, Spark provides a tool set for high performance advanced analytics. Connecting Spark with MongoDB enables you to achieve sophisticated back-end analytics in combination with the performance of MongoDB. We'll take a look at how these two systems integrate with one another through sample code and demonstrations.
Presentation from Bryan Reneiro, Developer Advocate at MongoDB.
Spark is a powerful framework for distributed processing of massive datasets. With an interactive shell, machine learning libraries, and in-memory data structures, Spark provides a tool set for high performance advanced analytics. Connecting Spark with MongoDB enables you to achieve sophisticated back-end analytics in combination with the performance of MongoDB. We'll take a look at how these two systems integrate with one another through sample code and demonstrations.
Presentation from Bryan Reneiro, Developer Advocate at MongoDB.
MongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDBMongoDB
Presented by Alexander Hendorf, Königsweg
Experience level: Deep dive
The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values. In this talk, I will showcase how to use the built-in data-aggregation-pipelines for averages, summation, grouping, reshaping. You will learn how to work with documents, sub-documents, grouping by year, month, day and more.
Php 102: Out with the Bad, In with the GoodJeremy Kendall
In this session, we'll look at a typical PHP application, review a few of the horrible mistakes the fictional developer made, and then refactor the app according to some best practices. Along the way you might even learn a thing or two about PHP you don't already know.
Open source ecosystem is very specific. In the same market we meet companies, end-users, professionals customers, editor, IT service providers and community with non-profit organization. These actors interact and create value.
Our research studies interaction between these actors and the impact on service value chain. Through an interaction model, Actor-network theory, and value chain of IT service, we interviewed 50 professionals (editor, IT service providers, CIO and community) with qualitative approach.
We will present for the first time the results of this research. This presentation will be oriented with a managerial approach:
- CIO: Do you have to choose an editor ? An IT service provider ?
- IT service provider: Community are better than Editor ? Where is the business in community approach ?
- Editor: Are you sure community is your friend ?
An impactful approach to the Seven Deadly Sins you and your Brand should avoid on Social Media! From a humoristic approach to a modern-life analogy for Social Media and including everything in between, this deck is a compelling resource that will provide you with more than a few take-aways for your Brand!
How People Really Hold and Touch (their Phones)Steven Hoober
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
Presented at ConveyUX in Seattle, 7 Feb 2014
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
We are finally starting to think about how touchscreen devices really work, and design proper sized targets, think about touch as different from mouse selection, and to create common gesture libraries.
But despite this we still forget the user. Fingers and thumbs take up space, and cover the screen. Corners of screens have different accuracy than the center. It's time to re-evaluate what we think we know.
Steven reviews his ongoing research into how people actually interact with mobile devices, presents some new ideas on how we can design to avoid errors and take advantage of this new knowledge, and leaves you with 10 (relatively) simple steps to improve your touchscreen designs tomorrow.
You are dumb at the internet. You don't know what will go viral. We don't either. But we are slighter less dumber. So here's a bunch of stuff we learned that will help you be less dumb too.
What 33 Successful Entrepreneurs Learned From FailureReferralCandy
Entrepreneurs encounter failure often. Successful entrepreneurs overcome failure and emerge wiser. We've taken 33 lessons about failure from Brian Honigman's article "33 Entrepreneurs Share Their Biggest Lessons Learned from Failure", illustrated them with statistics and a little story about entrepreneurship... in space!
MongoDB Days Silicon Valley: Data Analysis and MapReduce with MongoDBMongoDB
Presented by Alexander Hendorf, Königsweg
Experience level: Deep dive
The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values. In this talk, I will showcase how to use the built-in data-aggregation-pipelines for averages, summation, grouping, reshaping. You will learn how to work with documents, sub-documents, grouping by year, month, day and more.
Php 102: Out with the Bad, In with the GoodJeremy Kendall
In this session, we'll look at a typical PHP application, review a few of the horrible mistakes the fictional developer made, and then refactor the app according to some best practices. Along the way you might even learn a thing or two about PHP you don't already know.
Open source ecosystem is very specific. In the same market we meet companies, end-users, professionals customers, editor, IT service providers and community with non-profit organization. These actors interact and create value.
Our research studies interaction between these actors and the impact on service value chain. Through an interaction model, Actor-network theory, and value chain of IT service, we interviewed 50 professionals (editor, IT service providers, CIO and community) with qualitative approach.
We will present for the first time the results of this research. This presentation will be oriented with a managerial approach:
- CIO: Do you have to choose an editor ? An IT service provider ?
- IT service provider: Community are better than Editor ? Where is the business in community approach ?
- Editor: Are you sure community is your friend ?
An impactful approach to the Seven Deadly Sins you and your Brand should avoid on Social Media! From a humoristic approach to a modern-life analogy for Social Media and including everything in between, this deck is a compelling resource that will provide you with more than a few take-aways for your Brand!
How People Really Hold and Touch (their Phones)Steven Hoober
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
Presented at ConveyUX in Seattle, 7 Feb 2014
For the newest version of this presentation, always go to: 4ourth.com/tppt
For the latest video version, see: 4ourth.com/tvid
We are finally starting to think about how touchscreen devices really work, and design proper sized targets, think about touch as different from mouse selection, and to create common gesture libraries.
But despite this we still forget the user. Fingers and thumbs take up space, and cover the screen. Corners of screens have different accuracy than the center. It's time to re-evaluate what we think we know.
Steven reviews his ongoing research into how people actually interact with mobile devices, presents some new ideas on how we can design to avoid errors and take advantage of this new knowledge, and leaves you with 10 (relatively) simple steps to improve your touchscreen designs tomorrow.
You are dumb at the internet. You don't know what will go viral. We don't either. But we are slighter less dumber. So here's a bunch of stuff we learned that will help you be less dumb too.
What 33 Successful Entrepreneurs Learned From FailureReferralCandy
Entrepreneurs encounter failure often. Successful entrepreneurs overcome failure and emerge wiser. We've taken 33 lessons about failure from Brian Honigman's article "33 Entrepreneurs Share Their Biggest Lessons Learned from Failure", illustrated them with statistics and a little story about entrepreneurship... in space!
SEO has changed a lot over the last two decades. We all know about Google Panda & Penguin, but did you know there was a time when search engine results were returned by humans? Crazy right? We take a trip down memory lane to chart some of the biggest events in SEO that have helped shape the industry today.
Inside this guide, you'll learn an insiders tips and techniques to getting into the marketing industry - no job applications necessary.
You'll learn what marketing really is, why you'll find a job easily, what entry level marketing jobs look like and four actionable things you can try right now to help get you into the marketing industry.
Visit Inbound.org and the Inbound.org/jobs community jobs board to find opportunities and connect with professional marketers from all over.
The What If Technique presented by Motivate DesignMotivate Design
Why "What If"...?
The What If Technique tackles the challenge of engaging a creative, disruptive mindset when it comes to design thinking and crafting innovative user experiences.
Thinking disruptively is a disruptive thing to do, which means it's a very hard thing to do, especially when you add in risk-averse business leaders and company cultures, who hold on tight to psychological blocks, corporate lore, and excuse personas that stifle creativity and possibilities (see www.motivatedesign.com/what-if for more details).
The What If Technique offers key steps, tools and examples to help you achieve incremental changes that promote disruptive thinking, overcome barriers to creativity, and lead to big, innovative differences for business leaders, companies, and ultimately user experiences and products.
Let's find out what's what together! Explore your "What Ifs" with us. See www.motivatedesign.com/what-if for details about the What If Technique, studio workshops, the book, case studies and more downloads--including a the sample chapter "Corporate Lore and Blocks to Creativity"
Connect with us @Motivate_Design
Digital Strategy 101 is an overview of the current state of digital strategy and an exploration of core concepts, deliverables, and thought-leaders relevant to young practitioners.
Today we all live and work in the Internet Century, where technology is roiling the business landscape, and the pace of change is only accelerating.
In their new book How Google Works, Google Executive Chairman and ex-CEO Eric Schmidt and former SVP of Products Jonathan Rosenberg share the lessons they learned over the course of a decade running Google.
Covering topics including corporate culture, strategy, talent, decision-making, communication, innovation, and dealing with disruption, the authors illustrate management maxims with numerous insider anecdotes from Google’s history.
In an era when everything is speeding up, the best way for businesses to succeed is to attract smart-creative people and give them an environment where they can thrive at scale. How Google Works is a new book that explains how to do just that.
This is a visual preview of How Google Works. You can pick up a copy of the book at www.howgoogleworks.net
Using Apache ACE as a distribution and management platform for a large--and growing-- number of embedded devices in the field.
I used this presentation at Apachecon NA 2010.
I'm more about story and images than about text on slides, you can try to follow along here.
This sharing is talking about how Trend micro SPN using HBase to solve Graph model problem. And use pageRank to process our graph data to do predictive things. Then we also put the partial impl. of our Graph solution named HGraph on github for everyone interesting about this topic.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
This presentation will give you Information about :
1. What is Hadoop,
2. History of Hadoop,
3. Building Blocks – Hadoop Eco-System,
4. Who is behind Hadoop?,
5. What Hadoop is good for and why it is Good?,
OSGi technology is becoming the preferred approach for creating highly modular and dynamically extensible applications. With open source framework implementations like Eclipse Equinox and Apache Felix readily available, there is no better time to move to OSGi technology. However, doing so requires to master the assembly, provisioning, and discovery of the components that make-up your system. Apache ACE, an Apache Incubator project, is a software distribution framework that allows to centrally manage and distribute software components, configuration data, and other artifacts to target systems. We will focus on building and managing OSGi deployments, showing you how to use Apache ACE to bootstrap a framework and deploy to remotely managed systems. Also, we will show how ACE can be used to deploy bundles to an Android based phone.
Synapse india reviews on php website developmentsaritasingh19866
Rewritten again in and released as version 2.0 in November of 1997
Estimated user base in 1997 is several thousand users and 50,000 web sites served
Rewritten again in late 1997 by Andi Gutmans and Zeev Suraski
More functionality added, database support, protocols and APIs
CouchDB presentation with some technical details, made for a technical audience, shows use cases, comparison to other nosql databases and why it's useful for publishers
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
2. Hadoop and Cloudera
Managing Petabytes with Open Source
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
December 3, 2009
Thursday, December 3, 2009
3. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Several open source projects and research papers
▪ Founder of Cloudera
▪ Vice President of Products and Chief Scientist (other titles)
▪ Also, check out the book “Beautiful Data”
Thursday, December 3, 2009
4. Presentation Outline
▪ What is Hadoop?
▪ HDFS
▪ MapReduce
▪ Hive, Pig, Avro, Zookeeper, and friends
▪ Solving big data problems with Hadoop at Facebook and Yahoo!
▪ Short history of Facebook’s Data team
▪ Hadoop applications at Yahoo!, Facebook, and Cloudera
▪ Other examples: LHC, smart grid, genomes
▪ Questions and Discussion
Thursday, December 3, 2009
5. What is Hadoop?
▪ Apache Software Foundation project, mostly written in Java
▪ Inspired by Google infrastructure
▪ Software for programming warehouse-scale computers (WSCs)
▪ Hundreds of production deployments
▪ Project structure
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop MapReduce
▪ Hadoop Common
▪ Other subprojects
▪ Avro, HBase, Hive, Pig, Zookeeper
Thursday, December 3, 2009
6. Anatomy of a Hadoop Cluster
▪ Commodity servers
▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
▪ Typically arranged in 2 level architecture
▪
Commodity
40 nodes per rack Hardware Cluster
▪ Inexpensive to acquire and maintain
•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
Thursday, December 3, 2009 –! 40 nodes/rack
7. HDFS
▪ Pool commodity servers into a single hierarchical namespace
▪ Break files into 128 MB blocks and replicate blocks
▪ Designed for large files written once but read many times
▪ Files are append-only via a single writer
▪ Two major daemons: NameNode and DataNode
▪ NameNode manages file system metadata
▪ DataNode manages data using local filesystem
▪ HDFS manages checksumming, replication, and compression
▪ Throughput scales nearly linearly with node cluster size
Thursday, December 3, 2009
9. Hadoop MapReduce
▪ Fault tolerant execution layer and API for parallel data processing
▪ Can target multiple storage systems
▪ Key/value data model
▪ Two major daemons: JobTracker and TaskTracker
▪ Many client interfaces
▪ Java
▪ C++
▪ Streaming
▪ Pig
▪ SQL (Hive)
Thursday, December 3, 2009
10. MapReduce
MapReduce pushes work out to the data
(#)**+%$#41'%
Q" K"
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0% !" Q"
'$3#$1.<%$*%+;'"%=*34%
N" N"
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0% K"
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'% P"
&>+*'1)%:<%>*0*@&$"&?% !"
'$*3#.1%'<'$1>'A%
Q"
K"
P"
P"
!"
N"
"
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Thursday, December 3, 2009 "
11. Hadoop Subprojects
▪ Avro
▪ Cross-language framework for RPC and serialization
▪ HBase
▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive
▪ SQL interface to structured data stored in HDFS
▪ Pig
▪ Language for data flow programming; also Owl, Zebra, SQL
▪ Zookeeper
▪ Coordination service for distributed systems
Thursday, December 3, 2009
12. Hadoop Community Support
▪ 185+ contributors to the open source code base
▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC
▪ Three books (O’Reilly, Apress, Manning)
▪ Training videos free online
▪ Regular user group meetups in many cities
▪ New York Meetup group has 238 members
▪ University courses across the world
▪ Growing consultant and systems integrator expertise
▪ Commercial training, certification, and support from Cloudera
Thursday, December 3, 2009
13. Hadoop Project Mechanics
▪ Trademark owned by ASF; Apache 2.0 license for code
▪ Rigorous unit, smoke, performance, and system tests
▪ Release cycle of 9 months
▪ Last major release: 0.20.0 on April 22, 2009
▪ 0.21.0 will be last release before 1.0; nearly complete
▪ Subprojects on different release cycles
▪ Releases put to a vote according to Apache guidelines
▪ Releases made available as tarballs on Apache and mirrors
▪ Cloudera packages a distribution for many platforms
▪ RPM and Debian packages; AMI for Amazon’s EC2
Thursday, December 3, 2009
14. Hadoop at Facebook
Early 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier
▪ Intensive historical analysis difficult
▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL
▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, December 3, 2009
15. Facebook Data Infrastructure
2007
Scribe Tier MySQL Tier
Data Collection
Server
Oracle Database
Server
Thursday, December 3, 2009
16. Facebook Data Infrastructure
2008
Scribe Tier MySQL Tier
Hadoop Tier
Oracle RAC Servers
Thursday, December 3, 2009
17. Major Data Team Workloads
▪ Data collection
▪ server logs
▪ application databases
▪ web crawls
▪ Thousands of multi-stage processing pipelines
▪ Summaries consumed by external users
▪ Summaries for internal reporting
▪ Ad optimization pipeline
▪ Experimentation platform pipeline
▪ Ad hoc analyses
Thursday, December 3, 2009
18. Workload Statistics
Facebook 2009
▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
▪ 4 TB of compressed new data added per day
▪ 135TB of compressed data scanned per day
▪ 7,500+ Hive jobs on per day
▪ 80K compute hours per day
▪ Around 200 people per month run Hive jobs
(data from Ashish Thusoo’s Hadoop World NYC presentation)
Thursday, December 3, 2009
19. Hadoop at Yahoo!
▪ Jan 2006: Hired Doug Cutting
▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
▪ Aug 2008: Deployed 4,000 node Hadoop cluster
▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
▪ Sorted 1 PB on 3,658 nodes in 16.25 hours
▪ Other data points
▪ Over 25,000 nodes running Hadoop across 17 clusters
▪ Hundreds of thousands of jobs per day from over 600 users
▪ 82 PB of data
Thursday, December 3, 2009
20. Cloudera Offerings
Only One Slide, I Promise
▪ Two software products
▪ Cloudera’s Distribution for Hadoop
▪ Cloudera Desktop
▪ ...more on the way
▪ Support
▪ Professional services
Thursday, December 3, 2009
21. Hadoop at Cloudera
Cloudera’s Distribution for Hadoop
▪ Open source distribution of Apache Hadoop for enterprise use
▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper
▪ Ensures cross-subproject compatibility
▪ Adds backported patches and customer-specific patches
▪ Adds Cloudera utilities like MRUnit and Sqoop
▪ Better integration with daemon administration utilities
▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout
▪ Tools for automatically generating a configuration
▪ Packaged as RPM, DEB, AMI, or tarball
Thursday, December 3, 2009
22. Hadoop at Cloudera
Training and Certification
▪ Free online training
▪ Basic, Intermediate (including Hive and Pig), and Advanced
▪ Includes a virtual machine with software and exercises
▪ Live training sessions
▪ One live session per month somewhere in the world
▪ If you have a large group, we may come to you
▪ Certification
▪ Exams for Developers, Administrators, and Managers
▪ Administered online or in person
Thursday, December 3, 2009
23. Hadoop at Cloudera
Services and Support
▪ Professional Services
▪ Get Hadoop up and running in your environment
▪ Optimize an existing Hadoop infrastructure
▪ Design new algorithms to make the most of your data
▪ Support
▪ Unlimited questions for Cloudera’s technical team
▪ Access to our Knowledge Base
▪ Help prioritize feature development for CDH
▪ Early access to upcoming Cloudera software products
Thursday, December 3, 2009
24. Hadoop at Cloudera
Commercial Software
▪ General thesis: build commercially-licensed software products
which complement CDH for data management and analysis
▪ Current products
▪ Cloudera Desktop
▪ Extensible user interface for users of Cloudera software
▪ Upcoming products
▪ Talk to me in private
Thursday, December 3, 2009
25. Cloudera Desktop
Big Data can be Beautiful
Thursday, December 3, 2009
26. Gemini-Specific Questions
▪ Scala
▪ ScalaNLP’s SMR, Jonhnny Weslley’s SHadoop
▪ Jeff Hodges’s Componentize
▪ Compliance/regulatory domain
▪ Security roadmap
▪ HADOOP-4487
▪ HIVE-842
▪ Real-time BI
▪ MapReduce Online Prototype (MOP)
▪ Talk to me in private
Thursday, December 3, 2009
27. (c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, December 3, 2009