Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1isP0Rk.
Eva Andreasson presents typical categories of problems that are commonly solved using Hadoop and also some concrete examples in each category. Filmed at qconlondon.com.
Eva Andreasson has been working with Java virtual machine technologies, SOA, Cloud, and other enterprise middleware solutions for the past 10 years. She joined Cloudera in 2012. Recently launched Cloudera Search - a new real-time search workload natively integrated with the Hadoop stack - and drove the efforts that lead to 10x user uptake of Hue - the UI of the Hadoop ecosystem.
AWS Community Day CPH - Three problems of Terraform
What Can Hadoop Do for You?
1. 1
What Can Hadoop Do for You?
@EvaAndreasson | Sr. Product Manager, Cloudera
2014
CONFIDENTIAL - RESTRICTED
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-usage-examples
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Agenda
• Today’s data challenges
• Apache Hadoop and its ecosystem 101
• Common use cases
• Where to learn more
• Q&A
2
6. • “Return On Bytes”
• How to cost efficiently query, manage, and store
TBs (or even PBs) of data?
• Pre-mature data death
• Off-disk and archived data
difficult and costly to access
• Forced data silos
• Costly copies and moves of data
• Organizational blind-spots
Key Challenge #1: Volume
7. • Enough time to process raw data before you need it
• Data ingest from sensors, cameras, feeds, streaming, logs,
user interactions…
• Raw data structuring for
various ETL and DB models
Key Challenge #2: Velocity
8. • Costly adaption to new data types
• Saving account info, images, videos, url clicks, logs, and
transactional data – together?
• Inflexible data models
• Major surgery for future queries
• Most data is modeled for questions we know will be asked…
• Raw data value loss
Key Challenge #3: Variety
10. 8
A software framework that lets you find and
process data directly where it is stored and
where you only need to apply structure at
query time
What is Hadoop?
21. Inter-
active
SQL
Distributed File System
The Hadoop Ecosystem – Explained!
Stream / event-based data ingest
Batch Processing
KeyValue
Store
SQL
Proc.
Oriented
Query
Machine
Learning
Process MgmtWorkflow Mgmt
GUI
Resource and Workload Scheduler
Free-
Text
Search Real Time
Processing
Access Control
24. #1: Scale Data Processing at Low Cost
• Do what I usually do, but on a larger set of data
• Do my complex queries, but within a reasonable time
25. #2: Create an Active Archive
• Eliminate pre-mature data death
• Data as a (back-chargeable) service across the
organization
26. #3: Break Silos and Ask Bigger Questions
• What new insights can we achieve by combining
siloed data sets?
• What else can we find by asking questions over new
types of data?
27. There are a Lot of Use Cases…
• Predictive analysis (event prediction)
• Anomaly detection
• Customer profiling
• Recommendation engines
• Clickstream analysis
• Image processing
• Product and process improvements (feedback loops)
• Genome sequence processing
• Object matching
• Path optimization
• …
28. Do the Same – and More – to a Lower Cost
• Network and storage solution company
• Create pro-active support
• 600000 “phone home” logs/week
• SLA: 40% done within 18 hours
• Complex queries taking weeks or not even possible to run
• Expected future data growth of ~7TB a month!
• With Hadoop et al and Cloudera’s help
• Future proof scale
• Faster and more flexible analytic capabilities
• Correlation of disk latency with manufacturer (24 billion records)
• 64x query performance improvement (from weeks to hours)
• Pattern matching queries in the same infrastructure detecting bugs
(240 billion records)
• TCO freed up budget for other customer-focused projects
29. Increasing Revenue Through an Active Archive
• Global on-line retailer
• Need to correlate online/offline data
• 10+ year history records, 1,000’s product categories
• Large parts of data archived, complex to access
• With Hadoop et al and Cloudera’s help
• Unified long-term storage and processing over all data
• Machine learning and query without data moves
• Correlate all customer, product, and sales data ad hoc
• Lead to targeted marketing and increased revenue streams
31. To Learn More…
1. Read some good stuff
• Order the Hadoop Operations book
(http://shop.oreilly.com/product/0636920025085.do) and/or the Definitive Guide
to Hadoop (http://shop.oreilly.com/product/0636920021773.do)
• Visit Cloudera’s blog: blog.cloudera.com/
2. Play on your own
• Cloudera QuickStart VM:
https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Dem
o+VM
• View the videos at gethue.com
3. Get help and training
• Join or send an email to: cdh-user@cloudera.org
• Visit the Cloudera dev center: cloudera.com/content/dev-center/en/home.html
• Get training: university.cloudera.com
4. Contact Cloudera
• eva@cloudera.com
• On-line contact form: http://cloudera.com/content/cloudera/en/about/contact-
us/contact-form.html