Four Problems You Run into When DIY-
ing a “Big Data” analytic system.
(and how to solve them. Hint: Treasure Data)
Kiyoto Tamura & Jeff Yuan
Before we begin…




                   2
<announcements size=“two”>




                             3
1. we are hiring!




                    4
1. WE ARE HIRING!




                    5
We are looking for…




                      6
Lead UI/UX Designer




                      7
0

    8
which means…




               9
design the entire UI/UX




                          10
11
12
13
Anything that makes our
customer’s experience BETTER




                               14
super important
high-responsibility




                      15
Face of our service




                      16
Lead UI/UX Designer




                      17
careers@treasure-data.com




                            18
We are also looking for…




                           19
Engineers




            20
21
(Hadoop) Engineers




                     22
23
24
Distributed Systems




                      25
specifically




               26
(multi-tenant) Hadoop




                        27
Open Source!




               28
29
30
31
class MemcacheList(object):
  def push(self, key, value):
   """ Add an element to the front of the list """
   packed = msgpack.packb(value)
   self.connection.append(key, packed)

 def _unpack(self, data):
  if data == 'x90':
    return [], 0

  _unpacker = msgpack.Unpacker()
  _unpacker.feed(data)
                                                     32
class MemcacheList(object):
  def push(self, key, value):
   """ Add an element to the front of the list """
   packed = msgpack.packb(value)
   self.connection.append(key, packed)

 def _unpack(self, data):
  if data == 'x90':
    return [], 0

  _unpacker = msgpack.Unpacker()
  _unpacker.feed(data)
                                                     33
34
(more on Fluentd later)




                          35
#OneMoreThing




                36
37
“way better than C++!”




                         38
according to a committer




                           39
(who works at Treasure Data)




                               40
41
42
www.treasure-data.com/careers/




                                 43
1. We are hiring!




                    44
2. Discounts for Our Service!




                                45
(ask us for the secret coupon code)




                                      46
30% OFF


          47
6 months


           48
49
</announcements>




                   50
Four Problems You Run into
 When DIY-ing a “Big Data”
      analytic system.



                             51
52
Hadoop as-a-Service!




                       53
It’s a great idea




                    54
more accessible and useful




                             55
but also




           56
not so easy to implement




                           57
e.g.

       58
59
(zoom out)




             60
61
Hadoop as-a-Service




                      62
good in theory, lots of work in reality




                                          63
That’s where we come in!




                           64
Easiest (and most cost effective) way
to get answers about my data!




                                        65
 Collect/Store
 Query
 Access
 Scale




                  66
1. How do I collect my data and how do I
store them?

 Stream (access logs, standard error)

 Bulk (historical data, sales
  transactions, etc.)

 Secure and reliable storage!
                                           67
Client       Server

Apache

App

App        RDBMS

Other data sources
                     Treasure Data API
                           Layer


  csv

 json




                                         68
2. How do I query my data?

 Ad hoc queries


 Scheduled queries


 Data schema?


                             69
Cmdline,
        console                                                Query
                                                    API
                     HIVE, PIG (to be supported)               Processing
                                                   Layer
       Apps (JDBC,                                             Cluster
User   ODBC, REST)


                                                                   MapReduce
                                                                   Jobs



                     Amazon S3                             Hadoop cluster




                                                                            70
71
3. How do different users in my org
access query results?
 Different roles need to access results
  from different interfaces
 • Analysts -> Excel
 • Devs -> REST, MySQL




                                           72
Google Spreadsheet

           ODBC -> Excel (Coming Q1)
                                       Analysts

Treasure
  Data     MySQL, Postgres

           JDBC, REST API

           POST to web server          Engineers




                                                   73
4. How do I scale?

 More data?

 More queries?




                     74
Don’t worry, we’ll take care of it!




                                      75
Number of records in TD (in billions)

 120
 100
 80
 60
 40
 20
       Sep    Nov     Jan   Mar    May     Jul   Aug
       2011   2011   2012   2012   2012   2012   2012




January 2013 – Now over 200 Billion!
                                                        76
Treasure Data High-Level Architecture

   Log Data                                                     Spread Sheets




                                                                  BI Tools
Application Data

                                    Treasure Data
     Subscribe
                                   Data Warehouse      SQL
                        td-agent                                 Operational
 3rd Party Data                                     Interface     Analytics
                                                      JDBC
                                                      ODBC

                                                                 Databases
  Sensor Data




Web/Mobile Data                                                     CLI




                                                                                77
Our Customers – Fortune Global 500
leaders and start-ups including:




                                     78
 Japan’s #1 recipe website

 15 million users

 1 million recipes

                              79
MySQL to TD (Before)




                       80
MySQL to TD (Before)




                       81
MySQL to TD (After)




                      82
 Europe’s largest independent
  mobile ad exchange

 20 billion imps/month

 15,000+ mobile apps
                                 83
Two Weeks From Start to Finish!




                                  84

Four Problems You Run into When DIY-ing a “Big Data” Analytics System

Editor's Notes

  • #79 &lt;&lt;&lt;NOTE&gt;&gt;&gt; We have to add that we can not disclose some customers’ name here, including some of world’s largest enterprises and one of the world’s largest web company.