Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
Anyone who has used Hadoop knows that jobs sometimes get stuck. Hadoop is powerful, and it’s experiencing a tremendous rate of innovation, but it also has many rough edges. As Hadoop practitioners we all spend a lot of effort dealing with these rough edges in order to keep Hadoop and Hadoop jobs running well for our customers and/or organizations. For this session, we will look at a typical problem encountered by a Hadoop user, and discuss its implications for the future of Hadoop development. We will also go through the solution to this kind of problem using step-by-step instructions and the specific code we used to identify the issue. As a community, we need to work together to improve this kind of experience for our industry. Now that Hadoop 2 has been shipped, we believe the Hadoop community will be able to focus its energies on rounding off rough edges like these, and this session should provide advanced users with some tools and strategies to identify issues with jobs and how to keep these running smoothly.
This presentation will give you Information about :
1. What is Hadoop,
2. History of Hadoop,
3. Building Blocks – Hadoop Eco-System,
4. Who is behind Hadoop?,
5. What Hadoop is good for and why it is Good?,
HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the hadoop ecosystem, you have many tools that might be used for data processing - you might use pig or hive, or your own custom mapreduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like perl or python, or you may want to hook up that hbase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager / data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
Security is the greatest challenge for the widespread adoption of Hadoop in enterprises.
This meetup will discuss ways and means of how such challenges are being met with various solutions and/or products in the industry today. Industry security experts will showcase their varied experiences.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
Machine learning methods, such as SVM and neural net- works, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impracti- cal because of the large amount of computation required.
We introduce MALT, a machine learning library that inte- grates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model up- dates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its general-purpose API, MALT can be used to provide data-parallelism to existing ML appli- cations written in C++ and Lua and based on SVM, ma- trix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications.
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
Anyone who has used Hadoop knows that jobs sometimes get stuck. Hadoop is powerful, and it’s experiencing a tremendous rate of innovation, but it also has many rough edges. As Hadoop practitioners we all spend a lot of effort dealing with these rough edges in order to keep Hadoop and Hadoop jobs running well for our customers and/or organizations. For this session, we will look at a typical problem encountered by a Hadoop user, and discuss its implications for the future of Hadoop development. We will also go through the solution to this kind of problem using step-by-step instructions and the specific code we used to identify the issue. As a community, we need to work together to improve this kind of experience for our industry. Now that Hadoop 2 has been shipped, we believe the Hadoop community will be able to focus its energies on rounding off rough edges like these, and this session should provide advanced users with some tools and strategies to identify issues with jobs and how to keep these running smoothly.
This presentation will give you Information about :
1. What is Hadoop,
2. History of Hadoop,
3. Building Blocks – Hadoop Eco-System,
4. Who is behind Hadoop?,
5. What Hadoop is good for and why it is Good?,
HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the hadoop ecosystem, you have many tools that might be used for data processing - you might use pig or hive, or your own custom mapreduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like perl or python, or you may want to hook up that hbase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager / data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
Security is the greatest challenge for the widespread adoption of Hadoop in enterprises.
This meetup will discuss ways and means of how such challenges are being met with various solutions and/or products in the industry today. Industry security experts will showcase their varied experiences.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
Machine learning methods, such as SVM and neural net- works, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impracti- cal because of the large amount of computation required.
We introduce MALT, a machine learning library that inte- grates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model up- dates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its general-purpose API, MALT can be used to provide data-parallelism to existing ML appli- cations written in C++ and Lua and based on SVM, ma- trix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
Members of the Chef for OpenStack community had a meetup on the last day of the Spring 2013 OpenStack Summit to coordinate and plan further Grizzly work. These are our notes, we'll report back at the Fall 2013 OpenStack Summit what we accomplished.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
2. Hadoop, Cloudera, and eBay
Managing Petabytes with Open Source
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
January 28, 2010
Thursday, January 28, 2010
3. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Several open source projects and research papers
▪ Founder of Cloudera
▪ Vice President of Products and Chief Scientist
▪ Also, check out the book “Beautiful Data”
Thursday, January 28, 2010
4. Presentation Outline
▪ What is Hadoop?
▪ HDFS and MapReduce
▪ Hive, Pig, Avro, Zookeeper, HBase
▪ From Steve
▪ Why Hadoop?
▪ Hadoop for machine learning and modeling
▪ Other things I find interesting
▪ What we’re building at Cloudera
▪ Questions and Discussion
Thursday, January 28, 2010
5. What is Hadoop?
▪ Apache Software Foundation project, mostly written in Java
▪ Inspired by Google infrastructure
▪ Software for programming warehouse-scale computers (WSCs)
▪ Hundreds of production deployments
▪ Project structure
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop MapReduce
▪ Hadoop Common
▪ Other subprojects
▪ Avro, HBase, Hive, Pig, Zookeeper
Thursday, January 28, 2010
6. Anatomy of a Hadoop Cluster
▪ Commodity servers
▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA
▪ Typically arranged in 2 level architecture
Commodity Hardware Cluster
▪ Inexpensive to acquire and maintain
•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
Thursday, January 28, 2010 –! 40 nodes/rack
7. HDFS
▪ Pool commodity servers into a single hierarchical namespace
▪ Break files into 128 MB blocks and replicate blocks
▪ Designed for large files written once but read many times
▪ Files are append-only via a single writer
▪ Two major daemons: NameNode and DataNode
▪ NameNode manages file system metadata
▪ DataNode manages data using local filesystem
▪ HDFS manages checksumming, replication, and compression
▪ Throughput scales nearly linearly with node cluster size
Thursday, January 28, 2010
9. Hadoop MapReduce
▪ Fault tolerant execution layer and API for parallel data processing
▪ Can target multiple storage systems
▪ Key/value data model
▪ Two major daemons: JobTracker and TaskTracker
▪ Many client interfaces
▪ Java
▪ C++
▪ Streaming
▪ Pig
▪ SQL (Hive)
Thursday, January 28, 2010
10. MapReduce
MapReduce pushes work out to the data
(#)**+%$#41'%
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0%
'$3#$1.<%$*%+;'"%=*34%
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0%
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'%
&>+*'1)%:<%>*0*@&$"&?%
'$*3#.1%'<'$1>'A%
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Thursday, January 28, 2010
11. Hadoop Subprojects
▪ Avro
▪ Cross-language framework for RPC and serialization
▪ HBase
▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive
▪ SQL interface to structured data stored in HDFS
▪ Pig
▪ Language for data flow programming; also Owl, Zebra, SQL
▪ Zookeeper
▪ Coordination service for distributed systems
Thursday, January 28, 2010
12. Hadoop Community Support
▪ 185+ contributors to the open source code base
▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC
▪ Regular user group meetups in many cities
▪ Bay Area Meetup group has 534 members
▪ Three books (O’Reilly, Apress, Manning)
▪ Training videos free online
▪ University courses across the world
▪ Growing consultant and systems integrator expertise
▪ Training, certification, support, and services from Cloudera
Thursday, January 28, 2010
13. Hadoop Project Mechanics
▪ Trademark owned by ASF; Apache 2.0 license for code
▪ Rigorous unit, smoke, performance, and system tests
▪ Release cycle of 9 months
▪ Last major release: 0.20.0 on April 22, 2009
▪ 0.21.0 will be last release before 1.0; nearly complete
▪ Subprojects on different release cycles
▪ Releases put to a vote according to Apache guidelines
▪ Releases made available as tarballs on Apache and mirrors
▪ Cloudera packages a distribution for many platforms
▪ RPM and Debian packages; AMI for Amazon’s EC2
Thursday, January 28, 2010
14. Hadoop at Facebook
Early 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier
▪ Intensive historical analysis difficult
▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL
▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, January 28, 2010
15. Facebook Data Infrastructure
2007
Scribe Tier MySQL Tier
Data Collection
Server
Oracle Database
Server
Thursday, January 28, 2010
16. Facebook Data Infrastructure
2008
Scribe Tier MySQL Tier
Hadoop Tier
Oracle RAC Servers
Thursday, January 28, 2010
17. Major Data Team Workloads
▪ Data collection
▪ server logs
▪ application databases
▪ web crawls
▪ Thousands of multi-stage processing pipelines
▪ Summaries consumed by external users
▪ Summaries for internal reporting
▪ Ad optimization pipeline
▪ Experimentation platform pipeline
▪ Ad hoc analyses
Thursday, January 28, 2010
18. Workload Statistics
Facebook 2010
▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage
▪ 12 TB of compressed new data added per day
▪ 135TB of compressed data scanned per day
▪ 7,500+ Hive jobs on per day
▪ 80K compute hours per day
▪ Around 200 people per month run Hive jobs
(data from Ashish Thusoo’s Bay Area ACM DM SIG presentation)
Thursday, January 28, 2010
19. Why Did Facebook Choose Hadoop?
1. Demonstrated effectiveness for primary workload
2. Proven ability to scale past any commercial vendor
3. Easy provisioning and capacity planning with commodity nodes
4. Data access for all: engineers, business analysts, sales managers
5. Single system to manage XML/JSON, text, and relational data
6. No schemas enabled data collection without involving Data team
7. Simple, modular architecture
8. Easy to build, deploy, and monitor
9. Apache-licensed open source code granted to ASF
Thursday, January 28, 2010
20. Why Did Facebook Choose Hadoop?
▪ Most importantly: the community
▪ Broad and deep commitment to future development from
multiple organizations
▪ Interaction with a community often useful for recruiting
▪ Growing body of users and operators with prior expertise
meant lower cost of training new users
▪ Learn about best practices from other organizations
▪ Widely available public materials for improving skills
▪ Not then, but now
▪ Commercial training, certification, support, and services
▪ Growing body of complementary software
Thursday, January 28, 2010
21. Hadoop and Machine Learning/Modeling
▪ Data preparation using familiar programming tools
▪ Scalable historical storage of data for training and validation
▪ Field coding, aggregation, and data quality assertions
▪ Feature extraction over massive or complex data sets
▪ Efficient sampling and extraction to other tools
▪ Combination with other data sets
▪ Extensible metadata for organizing data sets
▪ Fundamental operations
▪ Matrix multiplication and other linear algebra
▪ Statistical tests of significance
Thursday, January 28, 2010
22. Hadoop and Machine Learning/Modeling
▪ Scoring
▪ eHarmony matching users
▪ Fraud detection for billing platforms
▪ Genetic Algorithms
▪ Mailchimp’s Project Omnivore
▪ Xavier Llorà’s research
▪ Collaborative filtering
▪ Google News personalization
▪ Yahoo! front page personalization (Cokeheads)
Thursday, January 28, 2010
23. Hadoop and Machine Learning/Modeling
▪ Model fitting
▪ EM algorithm and HMMs (Jimmy Lin)
▪ Graph analysis
▪ Finding largest connected component (Jeff Hodges)
▪ Social graph analysis (Jake Hofman)
▪ Document analysis
▪ Named entity extraction (Evri)
▪ Document similarity (Jimmy Lin)
▪ Image similarity: Google paper
Thursday, January 28, 2010
24. Hadoop and Machine Learning/Modeling
▪ Classification
▪ Google’s PLANET for building decision trees
▪ eBay’s linear Poisson regression for behavioral targeting
▪ Sessionization of clickstream logs and path prediction
▪ Bioinformatics
▪ Cloudburst
▪ Crossbow
▪ Computer vision
▪ Face detection
▪ Face recognition
Thursday, January 28, 2010
25. Hadoop and Machine Learning/Modeling
▪ Simulation
▪ Protein folding
▪ Particle-swarm optimization
▪ Crazy stuff
▪ Factoring integers
▪ Solving Boggle
▪ Generating fractals
▪ Books and conferences
▪ MDAC 2010
▪ “Data Intensive Text Processing with MapReduce”
Thursday, January 28, 2010
26. Hadoop at Cloudera
Cloudera’s Distribution for Hadoop
▪ Open source distribution of Apache Hadoop for enterprise use
▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper
▪ Ensures cross-subproject compatibility
▪ Adds backported patches and customer-specific patches
▪ Adds Cloudera utilities like MRUnit and Sqoop
▪ Better integration with daemon administration utilities
▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout
▪ Tools for automatically generating a configuration
▪ Packaged as RPM, DEB, AMI, or tarball
Thursday, January 28, 2010
27. Hadoop at Cloudera
Training and Certification
▪ Free online training
▪ Basic, Intermediate (including Hive and Pig), and Advanced
▪ Includes a virtual machine with software and exercises
▪ Live training sessions
▪ One live session per month somewhere in the world
▪ If you have a large group, we may come to you
▪ Certification
▪ Exams for Developers, Administrators, and Managers
▪ Administered online or in person
Thursday, January 28, 2010
28. Hadoop at Cloudera
Services and Support
▪ Professional Services
▪ Get Hadoop up and running in your environment
▪ Optimize an existing Hadoop infrastructure
▪ Design new algorithms to make the most of your data
▪ Support
▪ Unlimited questions for Cloudera’s technical team
▪ Access to our Knowledge Base
▪ Help prioritize feature development for CDH
▪ Early access to upcoming Cloudera software products
Thursday, January 28, 2010
29. Hadoop at Cloudera
Commercial Software
▪ General thesis: build commercially-licensed software products
which complement CDH for data management and analysis
▪ Current products
▪ Cloudera Desktop
▪ Extensible interface for users of Cloudera software
▪ Upcoming products for data collection
▪ Talk to me offline
Thursday, January 28, 2010
30. Cloudera Desktop
Big Data can be Beautiful
Thursday, January 28, 2010
31. (c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, January 28, 2010