1. Treasure Data
Hadoop meets Cloud with Multi-Tenancy
Kazuki Ohta
Founder and CTO at Treasure Data, Inc.
Hadoopユーザー会
k@treasure-data.com
@kzk_mover
Friday, April 5, 13
2. Who are you?
Kazuki Ohta (太田一樹)
• @kzk_mover, k@treasure-data.com
Treasure Data, Inc.
• Chief Technology Officer, Founded July 2011
Hadoop User Group Japan
• One of Founders
• “Hadoop徹底入門”
Open-Source Enthusiast
• Hadoop, memcached, jemalloc, MongoDB, memcached, uim, etc...
2
Friday, April 5, 13
7. Hadoop Versions
Too Many Variations (+Eco System)
from http://marblejenka.blogspot.jp/2013/01/hadoop.html 7
Friday, April 5, 13
8. Current Big Data Solutions: ‘Feature Creep’
http://en.wikipedia.org/wiki/Feature_creep 8
Friday, April 5, 13
9. We need Machete :)
EVERYTHING
with
ONE interface
Simple & Discoverable
Machete Design by James Lindenbaum
Heroku Co-Founder
http://www.youtube.com/watch?v=3BhDLm9jo5Y
9
Friday, April 5, 13
10. ‘Simplicity’ itself is a feature :)
by Anand Babu Periasamy
GlusterFS Co-Founder
10
Friday, April 5, 13
13. Battle Field of IaaS Vendors: SCM
HW Performance / Price In the near future, most of
HW buyers aren’t individual
companies, but cloud.
IaaS Vendors
Decrease with Battle Field:
Moore’s Law Supply Chain Management
On-Premise
Time
13
Friday, April 5, 13
14. PaaS, SaaS:
IT is all about Operation
More Sleep, More Value
With PaaS, you offload your development operations function and
have the PaaS provider handle the tools and components required to
deploy and manage applications reliably. - EngineYard
14
Friday, April 5, 13
15. PaaS/SaaS Battle Field: ‘Time’ is Money
Ideal
Customer Expectation
Value
Obsolete
over time
Reality
(On-Premise)
Upgrade
HW/SW Selection, PoC, Deploy...
Time
Sign-up or PO
15
Friday, April 5, 13
16. Introduction
to
Treasure Data
16
Friday, April 5, 13
18. Company Overview
Silicon Valley-based Company
• All Founders are Japanese
• Hironobu Yoshikawa
• Kazuki Ohta
• Sadayuki Furuhashi
OSS Enthusiasts
• MessagePack, Fluentd, etc.
• Cloud native
18
Friday, April 5, 13
19. 19
Our 50+ Customers – Fortune Global 500 leaders
and start-ups including:
250 billion records / month
in Feb 2013
2 million jobs executed
Friday, April 5, 13
21. Investors
Bill Tai
Naren Gupta - Nexus Ventures, Director of Redhat, TIBCO
Othman Laraki - Former VP Growth at Twitter
James Lindenbaum, Adam Wiggins, Orion Henry - Heroku
Founders
Anand Babu Periasamy, Hitesh Chellani - Gluster
Founders
Yukihiro “Matz” Matsumoto - Creator of Ruby
Jerry Yang, Founder of Yahoo!
Dan Scheinman - Director of Arista Networks
where Hadoop was invented :)
+ 10 more people
Check out Today (2013/01/21)’s Morning 日経新聞!
• and....
21
Friday, April 5, 13
22. Treasure Data’s
Philosophy and Architecture
22
Friday, April 5, 13
23. Big Data Adoption Stages
Optimization What’s the best?
Predictive Analysis What’s a trend? Analytics
Statistical Analysis Treasure Data’s FOCUS
Why?
Alerts Error?(80% of needs)
Drill Down Query Where exactly?
Reporting
Ad-hoc Reports Where?
Standard Reports What happened?
Intelligence Sophistication
23
Friday, April 5, 13
24. Full Stack Support for Big Data Reporting
Our best-in-class architecture Data from almost any source
and operations team ensure the can be securely and reliably
integrity and availability of your uploaded using td-agent in
data. streaming or batch mode.
Our SQL, REST, JDBC, ODBC You can store gigabytes to
and command-line interfaces petabytes of data efficiently and
support all major query tools securely in our cloud-based
and approaches. columnar datastore.
24
Friday, April 5, 13
25. Treasure Data = Collect + Store + Query
25
Friday, April 5, 13
26. Example in AdTech: MobFox
1. Europe’s largest independent mobile ad exchange.
2. 20 billion imps/month (circa Jan. 2013)
3. Serving ads for 15,000+ mobile apps (circa Jan. 2013)
4. Needed Big Data Analytics infrastructure ASAP.
26
Friday, April 5, 13
28. Our Value was Proven :)
Customer Our Value: Save Time!
Value
Obsolete
over time
Reality
(On-Premise)
Simple
Interface
Upgrade
HW/SW Selection, PoC, Deploy...
Time
Sign-up or PO
28
Friday, April 5, 13
29. Architecture Breakdown
Data Collection Data Store/Analytics Connectivity
• Increasing variety of • Remaining complexity in • Required to ensure
data sources both traditional DWH connectivity with
• No single data schema and Hadoop (very slow existing BI/visualization/
• Lack of streaming data time to market) apps by JDBC, REST
collection method • Challenges in scaling and ODBC.
• 60% of Big Data project data volume and
resource consumed expanding cost.
29
Friday, April 5, 13
30. 1) Data Collection
60% of BI project resource is consumed here
Most ‘underestimated’ and ‘unsexy’ but MOST important
Fluentd: OSS lightweight but robust Log Collector
• http://fluentd.org/
These talks will cover Fluentd :)
15:40∼ Log analysis system with Hadoop in livedoor 2013
by Satoshi Tagomori @ NHN Japan
16:30∼ いかにしてHadoopにデータを集めるか
by Sadayuki Furuhahsi @ Treasure Data, Inc.
30
Friday, April 5, 13
31. 2) Data Store / Analytics - Columnar Storage
31
Friday, April 5, 13
32. 3) Connectivity
REST API
td-command
Query
Query
Query API
Processing
JDBC, ODBC Driver Cluster
BI apps
Web App
Treasure Data
Result MySQL Columnar Storage
Postgres
32
Friday, April 5, 13
33. Most Difficult Challenge: Multi-Tenancy
All customers share the Hadoop clusters (4 Data Centers)
Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade
Job Submission
+ Plan Change
Local FairScheduler
datacenter A
Local FairScheduler
Global
datacenter B
Scheduler
Local FairScheduler
datacenter C On-Demand
Resouce Allocation
Local FairScheduler
datacenter D
33
Friday, April 5, 13
34. Conclusion
Big Data is too complex
• Needs Simplicity
• Machete v.s. Swiss Army Knife (Feature Creep)
IT is changing
• The value of Software itself is decreasing
• Operation is the key
Treasure Data = Cloud + Big Data
• Currently Focusing on Big Data Reporting
• Instant Value with Simple Interface
34
Friday, April 5, 13
35. We’re Hiring Top Talents, please contact me :)
35
Friday, April 5, 13
37. Big Data Market Growth
(average of IDC, Gartner and Wikibon stats) Big Data Revenue Breakdown
CAGR 38%
“In 2012…BI and Analytics are
rated #1 priorities.”
— Ravi Kalakota, Gartner
“Big Data is the new definitive source of
“More than half a billion dollars in venture capital
competitive advantage across all
has been invested in new big data technology.” industries.”
— Dan Vessett, IDC — Jeff Kelly, Wikibon
37
Friday, April 5, 13
38. Big Data Situation
Customer
Treasure Data
Value
RedShift
AWS
Obsolescence
over time
EMR
Software B
Software A On-premise
solutions
Time
Sign-up or PO
38
Friday, April 5, 13
39. Treasure Data Service Architecture
User
Apache
App Treasure Data
columnar data
App RDBMS
warehouse
Other data sources
MAPREDUCE JOBS
HIVE, PIG (to be supported)
td-command
Query
Query
Processing
API
JDBC, REST Cluster
BI apps
39
Friday, April 5, 13
40. Our Own Open Source technologies
We are open source natives and proud of our heritage.
We’ve contributed to Hibernate, Hadoop, Cassandra,
Memcached, KDE, MongoDB among others.
Our product reflects our deep commitment to the open-source
community and is built on top of open source software we’ve
authored and open sourced.
• Fluentd - a popular data collector daemon written in Ruby
www.fluentd.org (a leading user: SlideShare/Linkedin, One Kings Lane)
• MessagePack - a fast, compact serializer.
www.msgpack.org (a leading user: Pinterest, Redis)
Substantial commitment
(Code, Packaging, Documentation,
Sponsorship)
Tech marketing, Possible lead gen
40
Friday, April 5, 13