SlideShare a Scribd company logo
1 of 84
Agile Analytics Applications
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks

Formerly Viz, Data Science at Ning, LinkedIn

HBase Dashboards, Career Explorer, InMaps




Ā© Hortonworks Inc. 2012
                                                             1
Agile Data - The Book (March, 2013)

                              Read it now on OFPS



                                 A philosophy,
                                 not the only way



                             But still, its good! Really!
   Ā© Hortonworks Inc. 2012                             2
We go fast... but donā€™t worry!
ā€¢ Examples for EVERYTHING on the Hortonworks blog:
  http://hortonworks.com/blog/authors/russell_jurney

ā€¢ Download the slides - click the links - read examples!

ā€¢ If its not on the blog, its in the book!

ā€¢ Order now: http://shop.oreilly.com/product/0636920025054.do

ā€¢ Read the book NOW on OFPS:
ā€¢ http://ofps.oreilly.com/titles/9781449326265/chapter_2.html


        Ā© Hortonworks Inc. 2012                                 3
Agile Application Development: Check
ā€¢ LAMP stack mature
ā€¢ Post-Rails frameworks to choose from
ā€¢ Enable rapid feedback and agility




                                   + NoSQL



      Ā© Hortonworks Inc. 2012                4
Data Warehousing




   Ā© Hortonworks Inc. 2012   5
Scientific Computing / HPC
  ā€¢ ā€˜Smart kidā€™ only: MPI, Globus, etc. until Hadoop




Tubes and Mercury (old school)      Cores and Spindles (new school)

         UNIVAC and Deep Blue both fill a warehouse. Weā€™re back...
          Ā© Hortonworks Inc. 2012                                     6
Data Science?

 Application
                                                           Data Warehousing
Development

                                33%               33%




                                         33%


                               Scientiļ¬c Computing / HPC
     Ā© Hortonworks Inc. 2012                                                  7
Data Center as Computer
    ā€¢ Warehouse Scale Computers and applications




ā€œA key challenge for architects of WSCs is to smooth out these discrepancies in a cost efļ¬cient manner.ā€
Click here for a paper on operating a ā€˜data center as computer.ā€™

              Ā© Hortonworks Inc. 2012                                                                 8
Hadoop to the Rescue!
             Big data refinery / Modernize ETL
         Audio,                              Web, Mobile, CRM,
         Video,                                   ERP, SCM, ā€¦
        Images
                       New Data                                    Business
                                                                 Transactions
         Docs,         Sources
         Text,                                                   & Interactions
         XML

                                              HDFS
         Web
         Logs,
         Clicks
                              Big Data
        Social,               Refinery                            SQL   NoSQL     NewSQL
        Graph,
        Feeds
                                                                                           ETL
                                                                  EDW    MPP      NewSQL
       Sensors,
       Devices,
        RFID

                                                                    Business
        Spatial,
         GPS                 Apache Hadoop
                                                                   Intelligence
                                                                   & Analytics
        Events,
         Other                                Dashboards, Reports,
                                                   Visualization, ā€¦

                                                                                            Page 7



   I stole this slide from Eric. Update: He stole it from someone else.
   Ā© Hortonworks Inc. 2012                                                                           9
Hadoop to the Rescue!
ā€¢ Easy to use! (Pig, Hive, Cascading)
ā€¢ CHEAP: 1% the cost of SAN/NAS
ā€¢ A department can afford its own Hadoop cluster!

ā€¢ Dump all your data in one place: Hadoop DFS
ā€¢ Silos come CRASHING DOWN!
ā€¢ JOIN like crazy!
ā€¢ ETL like whoah!

ā€¢ An army of mappers and reducers at your command
ā€¢ OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
      Ā© Hortonworks Inc. 2012                       10
NOW WHAT?




  Ā© Hortonworks Inc. 2012
                            ?
                                11
Analytics Apps: It takes a Team
ā€¢ Broad skill-set to make useful apps
ā€¢ Basically nobody has them all
ā€¢ Application development is inherently collaborative




      Ā© Hortonworks Inc. 2012                           12
Data Science Team
ā€¢ 3-4 team members with broad, diverse skill-sets that overlap
ā€¢ Transactional overhead dominates at 5+ people

ā€¢ Expert researchers: lend 25-50% of their time to teams
ā€¢ Pick relevant researchers. Leave them alone. Theyā€™ll spawn
  new products by accident. Not just CS/Math. Design. Art?

ā€¢ Creative workers. Run like a studio, not an assembly line
ā€¢ Total freedom... with goals and deliverables.

ā€¢ Work environment matters most: private, social & quiet space
ā€¢ Desks/cubes optional
        Ā© Hortonworks Inc. 2012                                  13
How to get insight into product?
ā€¢ Back-end has gotten t-h-i-c-k-e-r


ā€¢ Generating $$$ insight can take 10-100x app dev


ā€¢ Timeline disjoint: analytics vs agile app-dev/design


ā€¢ How do you ship insights efficiently?


ā€¢ How do you collaborate on research vs developer timeline?
        Ā© Hortonworks Inc. 2012                               14
The Wrong Way - Part One




ā€œWe made a great design. Your job is to predict the future for it.ā€




        Ā© Hortonworks Inc. 2012                                 15
The Wrong Way - Part Two




ā€œWhats taking you so long to reliably predict the future?ā€




     Ā© Hortonworks Inc. 2012                                 16
The Wrong Way - Part Three




  ā€œThe users donā€™t understand what 86% true means.ā€




    Ā© Hortonworks Inc. 2012                           17
The Wrong Way - Part Four




 GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!




   Ā© Hortonworks Inc. 2012                          18
The Wrong Way - Inevitable Conclusion




                              Plane   Mountain


    Ā© Hortonworks Inc. 2012                      19
Reminds me of... the waterfall model




    Ā© Hortonworks Inc. 2012
                              :(       20
Chief Problem


You canā€™t design insight in analytics applications.


                               You discover it.


                      You discover by exploring.


     Ā© Hortonworks Inc. 2012                       21
-> Strategy


   So make an app for exploring your data.


  Iterate and publish intermediate results.


 Which becomes a palette for what you ship.


    Ā© Hortonworks Inc. 2012                   22
Data Design

ā€¢ Not the 1st query that = insight, its the 15th, or the 150th
ā€¢ Capturing ā€œAh ha!ā€ moments
ā€¢ Slow to do those in batch...

ā€¢ Faster, better context in an interactive web application.
ā€¢ Pre-designed charts wind up terrible. So bad.
ā€¢ Easy to invest man-years in the wrong statistical models
ā€¢ Semantics of presenting predictions are complex, delicate

ā€¢ Opportunity lies at intersection of data & design


      Ā© Hortonworks Inc. 2012                                    23
How do we get back to Agile?




   Ā© Hortonworks Inc. 2012     24
Statement of Principles




                              (then tricks, with code)




    Ā© Hortonworks Inc. 2012                              25
Setup an environment where...

ā€¢ Insights repeatedly produced
ā€¢ Iterative work shared with entire team
ā€¢ Interactive from day 0

ā€¢ Data model is consistent end-to-end
ā€¢ Minimal impedance between layers
ā€¢ Scope and depth of insights grow
ā€¢ Insights form the palette for what you ship

ā€¢ Until the application pays for itself and more

      Ā© Hortonworks Inc. 2012                      26
Value document > relation




Most data is dirty. Most data is semi-structured or un-structured. Rejoice!
        Ā© Hortonworks Inc. 2012                                         27
Value document > relation




Note: Hive/ArrayQL/NewSQLā€™s support of documents/array types blur this distinction.
         Ā© Hortonworks Inc. 2012                                                28
Relational Data = Legacy?
ā€¢ Why JOIN? Storage is fundamentally cheap!

ā€¢ Duplicate that JOIN data in one big record type!

ā€¢ ETL once to document format on import, NOT every job

ā€¢ Not zero JOINs, but far fewer JOINs

ā€¢ Semi-structured documents preserve dataā€™s actual structure

ā€¢ Column compressed document formats beat JOINs! (paper
  coming)
         Ā© Hortonworks Inc. 2012                           29
Value imperative > declarative
ā€¢ We donā€™t know what we want to SELECT.
ā€¢ Data is dirty - check each step, clean iteratively.
ā€¢ 85% of data scientistā€™s time spent munging. See: ETL.

ā€¢ Imperative is optimized for our process.
ā€¢ Process = iterative, snowballing insight
ā€¢ Efficiency matters, self optimize




      Ā© Hortonworks Inc. 2012                             30
Value dataflow > SELECT




   Ā© Hortonworks Inc. 2012   31
Ex. dataflow: ETL + email sent count




     Ā© Hortonworks Inc. 2012   (I canā€™t read this either. Get a big version here.)   32
Value Pig > Hive (for app-dev)
ā€¢ Pigs eat ANYTHING
ā€¢ Pig is optimized for refining data, as opposed to consuming it
ā€¢ Pig is imperative, iterative
ā€¢ Pig is dataflows, and SQLish (but not SQL)
ā€¢ Code modularization/re-use: Pig Macros
ā€¢ ILLUSTRATE speeds dev time (even UDFs)
ā€¢ Easy UDFs in Java, JRuby, Jython, Javascript
ā€¢ Pig Streaming = use any tool, period.
ā€¢ Easily prepare our data as it will appear in our app.
ā€¢ If you prefer Hive, use Hive.
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive...
                 See: HCatalog for Pig/Hive integration, and this post.

        Ā© Hortonworks Inc. 2012                                                       33
Localhost vs Petabyte scale: same tools
ā€¢ Simplicity essential to scalability: highest level tools we can
ā€¢ Prepare a good sample - tricky with joins, easy with
  documents
ā€¢ Local mode: pig -l /tmp -x local -v -w
ā€¢ Frequent use of ILLUSTRATE
ā€¢ 1st: Iterate, debug & publish locally
ā€¢ 2nd: Run on cluster, publish to team/customer
ā€¢ Consider skipping Object-Relational-Mapping (ORM)
ā€¢ We do not trust ā€˜databases,ā€™ only HDFS @ n=3.
ā€¢ Everything we serve in our app is re-creatable via Hadoop.

      Ā© Hortonworks Inc. 2012                                   34
Data-Value Pyramid




                Climb it. Do not skip steps. See here.

   Ā© Hortonworks Inc. 2012                               35
0/1) Display atomic records on the web




    Ā© Hortonworks Inc. 2012              36
0.0) Document-serialize events
ā€¢ Protobuf
ā€¢ Thrift
ā€¢ JSON

ā€¢ Avro - I use Avro because the schema is onboard.




       Ā© Hortonworks Inc. 2012                       37
0.1) Documents via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
     message_id:chararray,
     sql_date:chararray,
     from_address:chararray,
     from_name:chararray,
     subject:chararray,
     body:chararray
);
Ā 
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);
Ā 
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
Ā 
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
                                      CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
                                      TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
                                      enron_messages::subject as subject,
                                      enron_messages::body as body,
                                      headers::tos.(address, name) as tos,
                                      headers::ccs.(address, name) as ccs,
                                      headers::bccs.(address, name) as bccs;


store emails into '/enron/emails.avro' using AvroStorage(


                                                                                                                    Example here.
                      Ā© Hortonworks Inc. 2012                                                                                                  38
0.2) Serialize events from streams
class GmailSlurper(object):
  ...
Ā Ā def init_imap(self, username, password):
Ā Ā Ā Ā self.username = username
Ā Ā Ā Ā self.password = password
Ā Ā Ā Ā try:
Ā Ā Ā Ā Ā Ā imap.shutdown()
Ā Ā Ā Ā except:
Ā Ā Ā Ā Ā Ā pass
Ā Ā Ā Ā self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
Ā Ā Ā Ā self.imap.login(username, password)
Ā Ā Ā Ā self.imap.is_readonly = True
  ...
Ā Ā def write(self, record):
Ā Ā Ā Ā self.avro_writer.append(record)
  ...
Ā Ā def slurp(self):
Ā Ā Ā Ā if(self.imap and self.imap_folder):
Ā Ā Ā Ā Ā Ā for email_id in self.id_list:
Ā Ā Ā Ā Ā Ā Ā Ā (status, email_hash, charset) = self.fetch_email(email_id)
Ā Ā Ā Ā Ā Ā Ā Ā if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā print email_id, charset, email_hash['thread_id']
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā self.write(email_hash)




               Ā© Hortonworks Inc. 2012   Scrape your own gmail in Python and Ruby.                39
0.3) ETL Logs



log_data = LOAD 'access_log'
  USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
  AS (remoteAddr,
    remoteLogname,
    user,
    time,
    method,
    uri,
    proto,
    bytes);




      Ā© Hortonworks Inc. 2012                                         40
1) Plumb atomic events -> browser




      (Example stack that enables high productivity)

   Ā© Hortonworks Inc. 2012                             41
Lots of Stack Options with Examples
ā€¢ Pig with Voldemort, Ruby, Sinatra: example
ā€¢ Pig with ElasticSearch: example
ā€¢ Pig with MongoDB, Node.js: example

ā€¢ Pig with Cassandra, Python Streaming, Flask: example
ā€¢ Pig with HBase, JRuby, Sinatra: example
ā€¢ Pig with Hive via HCatalog: example (trivial on HDP)
ā€¢ Up next: Accumulo, Redis, MySQL, etc.




      Ā© Hortonworks Inc. 2012                            42
1.1) cat our Avro serialized events

me$ cat_avro ~/Data/enron.avro
{
    u'bccs': [],
    u'body': u'scamming people, blah blah',
    u'ccs': [],
    u'date': u'2000-08-28T01:50:00.000Z',
    u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
    u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
    u'subject': u'Re: Enron trade for frop futures',
    u'tos': [
      {u'address': u'connie@enron.com', u'name': None}
    ]
}



        Ā© Hortonworks Inc. 2012   Get cat_avro in python, ruby   43
1.2) Load our events in Pig

me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails

emails: {
  message_id: chararray,
  datetime: chararray,
  from:tuple(address:chararray,name:chararray)
  subject: chararray,
  body: chararray,
  tos: {to: (address: chararray,name: chararray)},
  ccs: {cc: (address: chararray,name: chararray)},
  bccs: {bcc: (address: chararray,name: chararray)}
}

Ā 



       Ā© Hortonworks Inc. 2012                                    44
1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails
Ā 



---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
| tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
|        |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
                                                   Upgrade to Pig 0.10+
         Ā© Hortonworks Inc. 2012                                              45
1.4) Publish our events to a ā€˜databaseā€™
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro 
   -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig


Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

/* By default, lets have 5 reducers */
set default_parallel 5

avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();


          Ā© Hortonworks Inc. 2012                Full instructions here.     46
1.5) Check events in our ā€˜databaseā€™
$ mongo enron

MongoDB shell version: 2.0.2
connecting to: enron

> show collections
emails
system.indexes

> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
"   "_id" : ObjectId("502b4ae703643a6a49c8d180"),
"   "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
"   "date" : "2001-01-09T06:38:00.000Z",
"   "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
"   "subject" : Re: Enron trade for frop futures,
"   "body" : "Scamming more people...",
"   "tos" : [ { "address" : "connie@enron", "name" : null } ],
"   "ccs" : [ ],
"   "bccs" : [ ]
}




          Ā© Hortonworks Inc. 2012                                             47
1.6) Publish events on the web

require    'rubygems'
require    'sinatra'
require    'mongo'
require    'json'

connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']

get '/email/:message_id' do |message_id|
  data = collection.find_one({:message_id => message_id})
  JSON.generate(data)
end




          Ā© Hortonworks Inc. 2012                     48
1.6) Publish events on the web




    Ā© Hortonworks Inc. 2012      49
Whats the point?
ā€¢ A designer can work against real data.
ā€¢ An application developer can work against real data.
ā€¢ A product manager can think in terms of real data.

ā€¢ Entire team is grounded in reality!
ā€¢ Youā€™ll see how ugly your data really is.
ā€¢ Youā€™ll see how much work you have yet to do.
ā€¢ Ship early and often!

ā€¢ Feels agile, donā€™t it? Keep it up!


      Ā© Hortonworks Inc. 2012                            50
1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
  <table class="table table-striped table-bordered table-condensed">
    <thead>
    {% for key in data['keys'] %}
         <th>{{ key }}</th>
    {% endfor %}
    </thead>
    <tbody>
         <tr>
         {% for value in data['values'] %}
           <td>{{ value }}</td>
         {% endfor %}
         </tr>
    </tbody>
  </table>
</div>
</body>                              Complete example here with code here.
           Ā© Hortonworks Inc. 2012                                               51
1.7) Wrap events with Bootstrap




    Ā© Hortonworks Inc. 2012       52
Refine. Add links between documents.




                         Not the Mona Lisa, but coming along... See: here
   Ā© Hortonworks Inc. 2012                                                  53
1.8) List links to sorted events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w


emails_per_user = foreach (group emails by from.address) {
      sorted = order emails by date;
      last_1000 = limit sorted 1000;
      generate group as from_address, emails as emails;
      };


store emails_per_user into '$mongourl' using MongoStorage();

Use your ā€˜databaseā€™, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
  {
           {
           " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
           " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
           " "from" : [
  ...

               Ā© Hortonworks Inc. 2012                                                        54
1.8) List links to sorted documents




    Ā© Hortonworks Inc. 2012           55
1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();


emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/
elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');




Test it with curl:
 curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'



ElasticSearch has no security features. Take note. Isolate.

          Ā© Hortonworks Inc. 2012                                                     56
From now on we speed up...




         Donā€™t worry, its in the book and on the blog.

                             http://hortonworks.com/blog/




   Ā© Hortonworks Inc. 2012                                  57
2) Create Simple Charts




   Ā© Hortonworks Inc. 2012   58
2) Create Simple Tables and Charts




   Ā© Hortonworks Inc. 2012           59
2) Create Simple Charts
ā€¢ Start with an HTML table on general principle.
ā€¢ Then use nvd3.js - reusable charts for d3.js
ā€¢ Aggregate by properties & displaying is first step in entity
 resolution
ā€¢ Start extracting entities. Ex: people, places, topics, time series
ā€¢ Group documents by entities, rank and count.

ā€¢ Publish top N, time series, etc.
ā€¢ Fill a page with charts.
ā€¢ Add a chart to your event page.
        Ā© Hortonworks Inc. 2012                                  60
2.1) Top N (of anything) in Pig


pig -l /tmp -x local -v -w

top_things = foreach (group things by key) {
  sorted = order things by arbitrary_rank desc;
  top_10_things = limit sorted 10;
  generate group as key, top_10_things as top_10_things;
  };
store top_n into '$mongourl' using MongoStorage();


Remember, this is the same structure the browser gets as json.

                       This would make a good Pig Macro.

       Ā© Hortonworks Inc. 2012                             61
2.2) Time Series (of anything) in Pig
pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
   generate flatten(group) as (key, month),
            COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
   timeseries = order things by month;
   generate group as key, timeseries as timeseries;
   };

store things_timeseries into '$mongourl' using MongoStorage();



                                  Yet another good Pig Macro.


        Ā© Hortonworks Inc. 2012                                    62
Data processing in our stack

A new feature in our application might begin at any layer... great!




                                                                        omghi2u!
           Iā€™m creative!                         Iā€™m creative too!   where r my legs?
            I know Pig!                          I <3 Javascript!
                                                                        send halp




    Any team member can add new features, no problemo!
        Ā© Hortonworks Inc. 2012                                            63
Data processing in our stack

... but we shift the data-processing towards batch, as we are able.




                                                       See real example here.
                               Ex: Overall total emails calculated in each layer
         Ā© Hortonworks Inc. 2012                                             64
3) Exploring with Reports




    Ā© Hortonworks Inc. 2012   65
3) Exploring with Reports




    Ā© Hortonworks Inc. 2012   66
3.0) From charts to reports...

ā€¢ Extract entities from properties we aggregated by in charts (Step 2)

ā€¢ Each entity gets its own type of web page

ā€¢ Each unique entity gets its own web page
ā€¢ Link to entities as they appear in atomic event documents (Step 1)

ā€¢ Link most related entities together, same and between types.

ā€¢ More visualizations!

ā€¢ Parametize results via forms.




          Ā© Hortonworks Inc. 2012                                  67
3.1) Looks like this...




    Ā© Hortonworks Inc. 2012   68
3.2) Cultivate common keyspaces




   Ā© Hortonworks Inc. 2012        69
3.3) Get people clicking. Learn.
ā€¢ Explore this web of generated pages, charts and links!
ā€¢ Everyone on the team gets to know your data.
ā€¢ Keep trying out different charts, metrics, entities, links.

ā€¢ See whats interesting.
ā€¢ Figure out what data needs cleaning and clean it.
ā€¢ Start thinking about predictions & recommendations.



    ā€˜Peopleā€™ could be just your team, if data is sensitive.

      Ā© Hortonworks Inc. 2012                                   70
4) Predictions and Recommendations




   Ā© Hortonworks Inc. 2012           71
4.0) Preparation
ā€¢ Weā€™ve already extracted entities, their properties and relationships

ā€¢ Our charts show where our signal is rich

ā€¢ Weā€™ve cleaned our data to make it presentable
ā€¢ The entire team has an intuitive understanding of the data

ā€¢ They got that understanding by exploring the data

ā€¢ We are all on the same page!




         Ā© Hortonworks Inc. 2012                                   72
4.1) Smooth sparse data




   Ā© Hortonworks Inc. 2012   See here.   73
4.2) Think in different perspectives
ā€¢ Networks
ā€¢ Time Series
ā€¢ Distributions

ā€¢ Natural Language
ā€¢ Probability / Bayes




      Ā© Hortonworks Inc. 2012   See here.   74
4.3) Sink more time in deeper analysis
TF-IDF
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');


/* Get the top 10 Tf*Idf scores per message */
per_message_cassandra = foreach (group tfidf_all by message_id) {
  sorted = order tfidf_all by value desc;
  top_10_topics = limit sorted 10;
  generate group, top_10_topics.(score, value);
}



Probability / Bayes
sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from,
                                                  sent_counts::to as to,
                                                  (float)reply_counts::total/(float)sent_counts::tot
as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;




            Ā© Hortonworks Inc. 2012   Example with code here and macro here.                 75
4.4) Add predictions to reports




    Ā© Hortonworks Inc. 2012       76
5) Enable new actions




   Ā© Hortonworks Inc. 2012   77
Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap'
Ā Ā USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconļ¬g');

countries = FOREACH snort_alerts
Ā Ā GENERATE
Ā Ā Ā Ā com.packetloop.packetpig.udf.geoip.Country(src) as country,
Ā Ā Ā Ā priority;

countries = GROUP countries BY country;

countries = FOREACH countries
Ā Ā GENERATE
Ā Ā Ā Ā group,
Ā Ā Ā Ā AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');




                                                                     Code here.
           Ā© Hortonworks Inc. 2012                                                78
Example: Packetpig and PacketLoop




   Ā© Hortonworks Inc. 2012          79
ā€¢ Amsterdam, March 20, 21st

ā€¢ Call for papers now open!

ā€¢ Submit a lightning talk!

ā€¢ http://hadoopsummit.org/amsterdam/

ā€¢ Discount coupons - 10% off!

        Ā© Hortonworks Inc. 2012        80
Hortonworks Data Platform
                                                     ā€¢ Simplify deployment to get
                                                       started quickly and easily

                                                     ā€¢ Monitor, manage any size
                                                       cluster with familiar console
                                                       and tools

                                                     ā€¢ Only platform to include data
                                1                      integration services to
                                                       interact with any data

                                                     ā€¢ Metadata services opens the
                                                       platform for integration with
                                                       existing applications

                                                     ā€¢ Dependable high availability
                                                       architecture
ļƒ¼ Reduce risks and cost of adoption
                                                     ā€¢ Tested at scale to future proof
ļƒ¼ Lower the total cost to administer and provision     your cluster growth
ļƒ¼ Integrate with your existing ecosystem


      Ā© Hortonworks Inc. 2012                                                     81
Hortonworks Training
                            The expert source for
                            Apache Hadoop training & certification

Role-based Developer and Administration training
  ā€“   Coursework built and maintained by the core Apache Hadoop development team.
  ā€“   The ā€œrightā€ course, with the most extensive and realistic hands-on materials
  ā€“   Provide an immersive experience into real-world Hadoop scenarios
  ā€“   Public and Private courses available



Comprehensive Apache Hadoop Certification
  ā€“ Become a trusted and valuable
    Apache Hadoop expert




        Ā© Hortonworks Inc. 2012                                                      82
Next Steps?

1                                 Download Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started




3   Learn moreā€¦ get support

                                                             Hortonworks Support
       ā€¢ Expert role based training                          ā€¢ Full lifecycle technical support
       ā€¢ Course for admins, developers                         across four service levels
         and operators                                       ā€¢ Delivered by Apache Hadoop
       ā€¢ Certification program                                 Experts/Committers
       ā€¢ Custom onsite options                               ā€¢ Forward-compatible

        hortonworks.com/training                             hortonworks.com/support


        Ā© Hortonworks Inc. 2012                                                                   83
Thank You!
Questions & Answers

Slides: http://slidesha.re/O8kjaF

Follow: @hortonworks and @rjurney
Read: hortonworks.com/blog




     Ā© Hortonworks Inc. 2012        84

More Related Content

What's hot

Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Hortonworks
Ā 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
Ā 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
Joey Jablonski
Ā 

What's hot (19)

From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
Ā 
Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using Hadoop
Ā 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
Ā 
The Next Generation of Big Data Analytics
The Next Generation of Big Data AnalyticsThe Next Generation of Big Data Analytics
The Next Generation of Big Data Analytics
Ā 
Democratizing Big Data with Microsoft Azure HDInsight
Democratizing Big Data with Microsoft Azure HDInsightDemocratizing Big Data with Microsoft Azure HDInsight
Democratizing Big Data with Microsoft Azure HDInsight
Ā 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ā 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
Ā 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Ā 
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Big Data, Hadoop, Hortonworks and Microsoft HDInsightBig Data, Hadoop, Hortonworks and Microsoft HDInsight
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Ā 
EDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteEDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old Favorite
Ā 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Ā 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
Ā 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Ā 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Ā 
IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?
Ā 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
Ā 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
Ā 
Silicon valley nosql meetup april 2012
Silicon valley nosql meetup  april 2012Silicon valley nosql meetup  april 2012
Silicon valley nosql meetup april 2012
Ā 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
Ā 

Viewers also liked

Question formation with interrogatives
Question formation with interrogativesQuestion formation with interrogatives
Question formation with interrogatives
Wendy Anderson
Ā 
Analyse prƩdictive en assurance santƩ par Julien Cabot
Analyse prƩdictive en assurance santƩ par Julien CabotAnalyse prƩdictive en assurance santƩ par Julien Cabot
Analyse prƩdictive en assurance santƩ par Julien Cabot
Modern Data Stack France
Ā 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
Ā 
Syncsort et le retour d'expƩrience ComScore
Syncsort et le retour d'expƩrience ComScoreSyncsort et le retour d'expƩrience ComScore
Syncsort et le retour d'expƩrience ComScore
Modern Data Stack France
Ā 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
Ā 
Cascalog prƩsentƩ par Bertrand Dechoux
Cascalog prƩsentƩ par Bertrand DechouxCascalog prƩsentƩ par Bertrand Dechoux
Cascalog prƩsentƩ par Bertrand Dechoux
Modern Data Stack France
Ā 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
Modern Data Stack France
Ā 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Modern Data Stack France
Ā 
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
CPV
Ā 

Viewers also liked (20)

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Ā 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
Ā 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
Ā 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Ā 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Ā 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
Ā 
Question formation with interrogatives
Question formation with interrogativesQuestion formation with interrogatives
Question formation with interrogatives
Ā 
Analyse prƩdictive en assurance santƩ par Julien Cabot
Analyse prƩdictive en assurance santƩ par Julien CabotAnalyse prƩdictive en assurance santƩ par Julien Cabot
Analyse prƩdictive en assurance santƩ par Julien Cabot
Ā 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
Ā 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
Ā 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
Ā 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Ā 
Syncsort et le retour d'expƩrience ComScore
Syncsort et le retour d'expƩrience ComScoreSyncsort et le retour d'expƩrience ComScore
Syncsort et le retour d'expƩrience ComScore
Ā 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
Ā 
Cascalog prƩsentƩ par Bertrand Dechoux
Cascalog prƩsentƩ par Bertrand DechouxCascalog prƩsentƩ par Bertrand Dechoux
Cascalog prƩsentƩ par Bertrand Dechoux
Ā 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
Ā 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
Ā 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Ā 
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
ESCRITO ESTA EN ACCIƓN. MATEO 9:18-38. (MT. No. 9B) JESƚS Y SU OBRA A FAVOR D...
Ā 
Coach Della Casa
Coach Della CasaCoach Della Casa
Coach Della Casa
Ā 

Similar to Paris HUG - Agile Analytics Applications on Hadoop

Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
russell_jurney
Ā 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
Ā 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks
Ā 

Similar to Paris HUG - Agile Analytics Applications on Hadoop (20)

Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
Ā 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Ā 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
Ā 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
Ā 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Ā 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Ā 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Ā 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Ā 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
Ā 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
Ā 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Ā 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Ā 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
Ā 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
Ā 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
Ā 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
Ā 
High Speed Smart Data Ingest
High Speed Smart Data IngestHigh Speed Smart Data Ingest
High Speed Smart Data Ingest
Ā 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Ā 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
Ā 
Integrating Hadoop Into the Enterprise ā€“Ā Hadoop Summit 2012
Integrating Hadoop Into the Enterprise ā€“Ā Hadoop Summit 2012Integrating Hadoop Into the Enterprise ā€“Ā Hadoop Summit 2012
Integrating Hadoop Into the Enterprise ā€“Ā Hadoop Summit 2012
Ā 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Ā 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Ā 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Ā 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
Ā 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Ā 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
Ā 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Ā 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Ā 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
Ā 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
Ā 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
Ā 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Ā 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Ā 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Ā 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
Ā 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Ā 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
Ā 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Ā 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Ā 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
Ā 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 

Paris HUG - Agile Analytics Applications on Hadoop

  • 1. Agile Analytics Applications Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks Formerly Viz, Data Science at Ning, LinkedIn HBase Dashboards, Career Explorer, InMaps Ā© Hortonworks Inc. 2012 1
  • 2. Agile Data - The Book (March, 2013) Read it now on OFPS A philosophy, not the only way But still, its good! Really! Ā© Hortonworks Inc. 2012 2
  • 3. We go fast... but donā€™t worry! ā€¢ Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney ā€¢ Download the slides - click the links - read examples! ā€¢ If its not on the blog, its in the book! ā€¢ Order now: http://shop.oreilly.com/product/0636920025054.do ā€¢ Read the book NOW on OFPS: ā€¢ http://ofps.oreilly.com/titles/9781449326265/chapter_2.html Ā© Hortonworks Inc. 2012 3
  • 4. Agile Application Development: Check ā€¢ LAMP stack mature ā€¢ Post-Rails frameworks to choose from ā€¢ Enable rapid feedback and agility + NoSQL Ā© Hortonworks Inc. 2012 4
  • 5. Data Warehousing Ā© Hortonworks Inc. 2012 5
  • 6. Scientific Computing / HPC ā€¢ ā€˜Smart kidā€™ only: MPI, Globus, etc. until Hadoop Tubes and Mercury (old school) Cores and Spindles (new school) UNIVAC and Deep Blue both fill a warehouse. Weā€™re back... Ā© Hortonworks Inc. 2012 6
  • 7. Data Science? Application Data Warehousing Development 33% 33% 33% Scientiļ¬c Computing / HPC Ā© Hortonworks Inc. 2012 7
  • 8. Data Center as Computer ā€¢ Warehouse Scale Computers and applications ā€œA key challenge for architects of WSCs is to smooth out these discrepancies in a cost efļ¬cient manner.ā€ Click here for a paper on operating a ā€˜data center as computer.ā€™ Ā© Hortonworks Inc. 2012 8
  • 9. Hadoop to the Rescue! Big data refinery / Modernize ETL Audio, Web, Mobile, CRM, Video, ERP, SCM, ā€¦ Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data Social, Refinery SQL NoSQL NewSQL Graph, Feeds ETL EDW MPP NewSQL Sensors, Devices, RFID Business Spatial, GPS Apache Hadoop Intelligence & Analytics Events, Other Dashboards, Reports, Visualization, ā€¦ Page 7 I stole this slide from Eric. Update: He stole it from someone else. Ā© Hortonworks Inc. 2012 9
  • 10. Hadoop to the Rescue! ā€¢ Easy to use! (Pig, Hive, Cascading) ā€¢ CHEAP: 1% the cost of SAN/NAS ā€¢ A department can afford its own Hadoop cluster! ā€¢ Dump all your data in one place: Hadoop DFS ā€¢ Silos come CRASHING DOWN! ā€¢ JOIN like crazy! ā€¢ ETL like whoah! ā€¢ An army of mappers and reducers at your command ā€¢ OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! Ā© Hortonworks Inc. 2012 10
  • 11. NOW WHAT? Ā© Hortonworks Inc. 2012 ? 11
  • 12. Analytics Apps: It takes a Team ā€¢ Broad skill-set to make useful apps ā€¢ Basically nobody has them all ā€¢ Application development is inherently collaborative Ā© Hortonworks Inc. 2012 12
  • 13. Data Science Team ā€¢ 3-4 team members with broad, diverse skill-sets that overlap ā€¢ Transactional overhead dominates at 5+ people ā€¢ Expert researchers: lend 25-50% of their time to teams ā€¢ Pick relevant researchers. Leave them alone. Theyā€™ll spawn new products by accident. Not just CS/Math. Design. Art? ā€¢ Creative workers. Run like a studio, not an assembly line ā€¢ Total freedom... with goals and deliverables. ā€¢ Work environment matters most: private, social & quiet space ā€¢ Desks/cubes optional Ā© Hortonworks Inc. 2012 13
  • 14. How to get insight into product? ā€¢ Back-end has gotten t-h-i-c-k-e-r ā€¢ Generating $$$ insight can take 10-100x app dev ā€¢ Timeline disjoint: analytics vs agile app-dev/design ā€¢ How do you ship insights efficiently? ā€¢ How do you collaborate on research vs developer timeline? Ā© Hortonworks Inc. 2012 14
  • 15. The Wrong Way - Part One ā€œWe made a great design. Your job is to predict the future for it.ā€ Ā© Hortonworks Inc. 2012 15
  • 16. The Wrong Way - Part Two ā€œWhats taking you so long to reliably predict the future?ā€ Ā© Hortonworks Inc. 2012 16
  • 17. The Wrong Way - Part Three ā€œThe users donā€™t understand what 86% true means.ā€ Ā© Hortonworks Inc. 2012 17
  • 18. The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! Ā© Hortonworks Inc. 2012 18
  • 19. The Wrong Way - Inevitable Conclusion Plane Mountain Ā© Hortonworks Inc. 2012 19
  • 20. Reminds me of... the waterfall model Ā© Hortonworks Inc. 2012 :( 20
  • 21. Chief Problem You canā€™t design insight in analytics applications. You discover it. You discover by exploring. Ā© Hortonworks Inc. 2012 21
  • 22. -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. Ā© Hortonworks Inc. 2012 22
  • 23. Data Design ā€¢ Not the 1st query that = insight, its the 15th, or the 150th ā€¢ Capturing ā€œAh ha!ā€ moments ā€¢ Slow to do those in batch... ā€¢ Faster, better context in an interactive web application. ā€¢ Pre-designed charts wind up terrible. So bad. ā€¢ Easy to invest man-years in the wrong statistical models ā€¢ Semantics of presenting predictions are complex, delicate ā€¢ Opportunity lies at intersection of data & design Ā© Hortonworks Inc. 2012 23
  • 24. How do we get back to Agile? Ā© Hortonworks Inc. 2012 24
  • 25. Statement of Principles (then tricks, with code) Ā© Hortonworks Inc. 2012 25
  • 26. Setup an environment where... ā€¢ Insights repeatedly produced ā€¢ Iterative work shared with entire team ā€¢ Interactive from day 0 ā€¢ Data model is consistent end-to-end ā€¢ Minimal impedance between layers ā€¢ Scope and depth of insights grow ā€¢ Insights form the palette for what you ship ā€¢ Until the application pays for itself and more Ā© Hortonworks Inc. 2012 26
  • 27. Value document > relation Most data is dirty. Most data is semi-structured or un-structured. Rejoice! Ā© Hortonworks Inc. 2012 27
  • 28. Value document > relation Note: Hive/ArrayQL/NewSQLā€™s support of documents/array types blur this distinction. Ā© Hortonworks Inc. 2012 28
  • 29. Relational Data = Legacy? ā€¢ Why JOIN? Storage is fundamentally cheap! ā€¢ Duplicate that JOIN data in one big record type! ā€¢ ETL once to document format on import, NOT every job ā€¢ Not zero JOINs, but far fewer JOINs ā€¢ Semi-structured documents preserve dataā€™s actual structure ā€¢ Column compressed document formats beat JOINs! (paper coming) Ā© Hortonworks Inc. 2012 29
  • 30. Value imperative > declarative ā€¢ We donā€™t know what we want to SELECT. ā€¢ Data is dirty - check each step, clean iteratively. ā€¢ 85% of data scientistā€™s time spent munging. See: ETL. ā€¢ Imperative is optimized for our process. ā€¢ Process = iterative, snowballing insight ā€¢ Efficiency matters, self optimize Ā© Hortonworks Inc. 2012 30
  • 31. Value dataflow > SELECT Ā© Hortonworks Inc. 2012 31
  • 32. Ex. dataflow: ETL + email sent count Ā© Hortonworks Inc. 2012 (I canā€™t read this either. Get a big version here.) 32
  • 33. Value Pig > Hive (for app-dev) ā€¢ Pigs eat ANYTHING ā€¢ Pig is optimized for refining data, as opposed to consuming it ā€¢ Pig is imperative, iterative ā€¢ Pig is dataflows, and SQLish (but not SQL) ā€¢ Code modularization/re-use: Pig Macros ā€¢ ILLUSTRATE speeds dev time (even UDFs) ā€¢ Easy UDFs in Java, JRuby, Jython, Javascript ā€¢ Pig Streaming = use any tool, period. ā€¢ Easily prepare our data as it will appear in our app. ā€¢ If you prefer Hive, use Hive. But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. Ā© Hortonworks Inc. 2012 33
  • 34. Localhost vs Petabyte scale: same tools ā€¢ Simplicity essential to scalability: highest level tools we can ā€¢ Prepare a good sample - tricky with joins, easy with documents ā€¢ Local mode: pig -l /tmp -x local -v -w ā€¢ Frequent use of ILLUSTRATE ā€¢ 1st: Iterate, debug & publish locally ā€¢ 2nd: Run on cluster, publish to team/customer ā€¢ Consider skipping Object-Relational-Mapping (ORM) ā€¢ We do not trust ā€˜databases,ā€™ only HDFS @ n=3. ā€¢ Everything we serve in our app is re-creatable via Hadoop. Ā© Hortonworks Inc. 2012 34
  • 35. Data-Value Pyramid Climb it. Do not skip steps. See here. Ā© Hortonworks Inc. 2012 35
  • 36. 0/1) Display atomic records on the web Ā© Hortonworks Inc. 2012 36
  • 37. 0.0) Document-serialize events ā€¢ Protobuf ā€¢ Thrift ā€¢ JSON ā€¢ Avro - I use Avro because the schema is onboard. Ā© Hortonworks Inc. 2012 37
  • 38. 0.1) Documents via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray ); Ā  enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); Ā  split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc'; Ā  headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. Ā© Hortonworks Inc. 2012 38
  • 39. 0.2) Serialize events from streams class GmailSlurper(object): ... Ā Ā def init_imap(self, username, password): Ā Ā Ā Ā self.username = username Ā Ā Ā Ā self.password = password Ā Ā Ā Ā try: Ā Ā Ā Ā Ā Ā imap.shutdown() Ā Ā Ā Ā except: Ā Ā Ā Ā Ā Ā pass Ā Ā Ā Ā self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) Ā Ā Ā Ā self.imap.login(username, password) Ā Ā Ā Ā self.imap.is_readonly = True ... Ā Ā def write(self, record): Ā Ā Ā Ā self.avro_writer.append(record) ... Ā Ā def slurp(self): Ā Ā Ā Ā if(self.imap and self.imap_folder): Ā Ā Ā Ā Ā Ā for email_id in self.id_list: Ā Ā Ā Ā Ā Ā Ā Ā (status, email_hash, charset) = self.fetch_email(email_id) Ā Ā Ā Ā Ā Ā Ā Ā if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā print email_id, charset, email_hash['thread_id'] Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā self.write(email_hash) Ā© Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 39
  • 40. 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); Ā© Hortonworks Inc. 2012 40
  • 41. 1) Plumb atomic events -> browser (Example stack that enables high productivity) Ā© Hortonworks Inc. 2012 41
  • 42. Lots of Stack Options with Examples ā€¢ Pig with Voldemort, Ruby, Sinatra: example ā€¢ Pig with ElasticSearch: example ā€¢ Pig with MongoDB, Node.js: example ā€¢ Pig with Cassandra, Python Streaming, Flask: example ā€¢ Pig with HBase, JRuby, Sinatra: example ā€¢ Pig with Hive via HCatalog: example (trivial on HDP) ā€¢ Up next: Accumulo, Redis, MySQL, etc. Ā© Hortonworks Inc. 2012 42
  • 43. 1.1) cat our Avro serialized events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } Ā© Hortonworks Inc. 2012 Get cat_avro in python, ruby 43
  • 44. 1.2) Load our events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} } Ā  Ā© Hortonworks Inc. 2012 44
  • 45. 1.3) ILLUSTRATE our events in Pig grunt> illustrate enron_emails Ā  --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ Ā© Hortonworks Inc. 2012 45
  • 46. 1.4) Publish our events to a ā€˜databaseā€™ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); Ā© Hortonworks Inc. 2012 Full instructions here. 46
  • 47. 1.5) Check events in our ā€˜databaseā€™ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron > show collections emails system.indexes > db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { " "_id" : ObjectId("502b4ae703643a6a49c8d180"), " "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", " "date" : "2001-01-09T06:38:00.000Z", " "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, " "subject" : Re: Enron trade for frop futures, " "body" : "Scamming more people...", " "tos" : [ { "address" : "connie@enron", "name" : null } ], " "ccs" : [ ], " "bccs" : [ ] } Ā© Hortonworks Inc. 2012 47
  • 48. 1.6) Publish events on the web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end Ā© Hortonworks Inc. 2012 48
  • 49. 1.6) Publish events on the web Ā© Hortonworks Inc. 2012 49
  • 50. Whats the point? ā€¢ A designer can work against real data. ā€¢ An application developer can work against real data. ā€¢ A product manager can think in terms of real data. ā€¢ Entire team is grounded in reality! ā€¢ Youā€™ll see how ugly your data really is. ā€¢ Youā€™ll see how much work you have yet to do. ā€¢ Ship early and often! ā€¢ Feels agile, donā€™t it? Keep it up! Ā© Hortonworks Inc. 2012 50
  • 51. 1.7) Wrap events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. Ā© Hortonworks Inc. 2012 51
  • 52. 1.7) Wrap events with Bootstrap Ā© Hortonworks Inc. 2012 52
  • 53. Refine. Add links between documents. Not the Mona Lisa, but coming along... See: here Ā© Hortonworks Inc. 2012 53
  • 54. 1.8) List links to sorted events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ā€˜databaseā€™, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", " "from" : [ ... Ā© Hortonworks Inc. 2012 54
  • 55. 1.8) List links to sorted documents Ā© Hortonworks Inc. 2012 55
  • 56. 1.9) Make it searchable... If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/ elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. Ā© Hortonworks Inc. 2012 56
  • 57. From now on we speed up... Donā€™t worry, its in the book and on the blog. http://hortonworks.com/blog/ Ā© Hortonworks Inc. 2012 57
  • 58. 2) Create Simple Charts Ā© Hortonworks Inc. 2012 58
  • 59. 2) Create Simple Tables and Charts Ā© Hortonworks Inc. 2012 59
  • 60. 2) Create Simple Charts ā€¢ Start with an HTML table on general principle. ā€¢ Then use nvd3.js - reusable charts for d3.js ā€¢ Aggregate by properties & displaying is first step in entity resolution ā€¢ Start extracting entities. Ex: people, places, topics, time series ā€¢ Group documents by entities, rank and count. ā€¢ Publish top N, time series, etc. ā€¢ Fill a page with charts. ā€¢ Add a chart to your event page. Ā© Hortonworks Inc. 2012 60
  • 61. 2.1) Top N (of anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. Ā© Hortonworks Inc. 2012 61
  • 62. 2.2) Time Series (of anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. Ā© Hortonworks Inc. 2012 62
  • 63. Data processing in our stack A new feature in our application might begin at any layer... great! omghi2u! Iā€™m creative! Iā€™m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! Ā© Hortonworks Inc. 2012 63
  • 64. Data processing in our stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer Ā© Hortonworks Inc. 2012 64
  • 65. 3) Exploring with Reports Ā© Hortonworks Inc. 2012 65
  • 66. 3) Exploring with Reports Ā© Hortonworks Inc. 2012 66
  • 67. 3.0) From charts to reports... ā€¢ Extract entities from properties we aggregated by in charts (Step 2) ā€¢ Each entity gets its own type of web page ā€¢ Each unique entity gets its own web page ā€¢ Link to entities as they appear in atomic event documents (Step 1) ā€¢ Link most related entities together, same and between types. ā€¢ More visualizations! ā€¢ Parametize results via forms. Ā© Hortonworks Inc. 2012 67
  • 68. 3.1) Looks like this... Ā© Hortonworks Inc. 2012 68
  • 69. 3.2) Cultivate common keyspaces Ā© Hortonworks Inc. 2012 69
  • 70. 3.3) Get people clicking. Learn. ā€¢ Explore this web of generated pages, charts and links! ā€¢ Everyone on the team gets to know your data. ā€¢ Keep trying out different charts, metrics, entities, links. ā€¢ See whats interesting. ā€¢ Figure out what data needs cleaning and clean it. ā€¢ Start thinking about predictions & recommendations. ā€˜Peopleā€™ could be just your team, if data is sensitive. Ā© Hortonworks Inc. 2012 70
  • 71. 4) Predictions and Recommendations Ā© Hortonworks Inc. 2012 71
  • 72. 4.0) Preparation ā€¢ Weā€™ve already extracted entities, their properties and relationships ā€¢ Our charts show where our signal is rich ā€¢ Weā€™ve cleaned our data to make it presentable ā€¢ The entire team has an intuitive understanding of the data ā€¢ They got that understanding by exploring the data ā€¢ We are all on the same page! Ā© Hortonworks Inc. 2012 72
  • 73. 4.1) Smooth sparse data Ā© Hortonworks Inc. 2012 See here. 73
  • 74. 4.2) Think in different perspectives ā€¢ Networks ā€¢ Time Series ā€¢ Distributions ā€¢ Natural Language ā€¢ Probability / Bayes Ā© Hortonworks Inc. 2012 See here. 74
  • 75. 4.3) Sink more time in deeper analysis TF-IDF import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } Probability / Bayes sent_replies = join sent_counts by (from, to), reply_counts by (from, to); reply_ratios = foreach sent_replies generate sent_counts::from as from, sent_counts::to as to, (float)reply_counts::total/(float)sent_counts::tot as ratio; reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio; Ā© Hortonworks Inc. 2012 Example with code here and macro here. 75
  • 76. 4.4) Add predictions to reports Ā© Hortonworks Inc. 2012 76
  • 77. 5) Enable new actions Ā© Hortonworks Inc. 2012 77
  • 78. Example: Packetpig and PacketLoop snort_alerts = LOAD '$pcap' Ā Ā USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconļ¬g'); countries = FOREACH snort_alerts Ā Ā GENERATE Ā Ā Ā Ā com.packetloop.packetpig.udf.geoip.Country(src) as country, Ā Ā Ā Ā priority; countries = GROUP countries BY country; countries = FOREACH countries Ā Ā GENERATE Ā Ā Ā Ā group, Ā Ā Ā Ā AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Code here. Ā© Hortonworks Inc. 2012 78
  • 79. Example: Packetpig and PacketLoop Ā© Hortonworks Inc. 2012 79
  • 80. ā€¢ Amsterdam, March 20, 21st ā€¢ Call for papers now open! ā€¢ Submit a lightning talk! ā€¢ http://hadoopsummit.org/amsterdam/ ā€¢ Discount coupons - 10% off! Ā© Hortonworks Inc. 2012 80
  • 81. Hortonworks Data Platform ā€¢ Simplify deployment to get started quickly and easily ā€¢ Monitor, manage any size cluster with familiar console and tools ā€¢ Only platform to include data 1 integration services to interact with any data ā€¢ Metadata services opens the platform for integration with existing applications ā€¢ Dependable high availability architecture ļƒ¼ Reduce risks and cost of adoption ā€¢ Tested at scale to future proof ļƒ¼ Lower the total cost to administer and provision your cluster growth ļƒ¼ Integrate with your existing ecosystem Ā© Hortonworks Inc. 2012 81
  • 82. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training ā€“ Coursework built and maintained by the core Apache Hadoop development team. ā€“ The ā€œrightā€ course, with the most extensive and realistic hands-on materials ā€“ Provide an immersive experience into real-world Hadoop scenarios ā€“ Public and Private courses available Comprehensive Apache Hadoop Certification ā€“ Become a trusted and valuable Apache Hadoop expert Ā© Hortonworks Inc. 2012 82
  • 83. Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn moreā€¦ get support Hortonworks Support ā€¢ Expert role based training ā€¢ Full lifecycle technical support ā€¢ Course for admins, developers across four service levels and operators ā€¢ Delivered by Apache Hadoop ā€¢ Certification program Experts/Committers ā€¢ Custom onsite options ā€¢ Forward-compatible hortonworks.com/training hortonworks.com/support Ā© Hortonworks Inc. 2012 83
  • 84. Thank You! Questions & Answers Slides: http://slidesha.re/O8kjaF Follow: @hortonworks and @rjurney Read: hortonworks.com/blog Ā© Hortonworks Inc. 2012 84

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. \n\nAs the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. &amp;#xA0;\n\nRun through the points on left&amp;#x2026;\n
  84. \n
  85. \n
  86. \n