0
Agile Analytics Applicationson HDPRussell Jurney (@rjurney) - Hadoop Evangelist @HortonworksFormerly Viz, Data Science at ...
About me... Bearding.• I’m going to beat this guy• Seriously• Bearding is my #1 natural talent• Salty Sea Beard• Fortified...
Agile Data - The Book (July, 2013)                              Read on Safari Rough Cuts                                 ...
We go fast... but don’t worry!• Examples for EVERYTHING on the Hortonworks blog:  http://hortonworks.com/blog/authors/russ...
HDP Sandbox - Talk Lessons Coming!   © Hortonworks Inc. 2012           6
Agile Application Development: Check• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and a...
Data Warehousing   © Hortonworks Inc. 2012   9
Scientific Computing / HPC  • ‘Smart kid’ only: MPI, Globus, etc. until HadoopTubes and Mercury (old school)      Cores an...
Data Science? Application                                                            Data WarehousingDevelopment          ...
Data Center as Computer     • Warehouse Scale Computers and applications“A key challenge for architects of WSCs is to smoo...
© Hortonworks Inc. 2012   13
© Hortonworks Inc. 2012   14
© Hortonworks Inc. 2012   15
© Hortonworks Inc. 2012   16
© Hortonworks Inc. 2012   17
18
Tez – Faster MapReduce!                          19
Hadoop to the Rescue!   © Hortonworks Inc. 2012   20
Hadoop to the Rescue!• Easy to use! (Pig, Hive, Cascading)• CHEAP: 1% the cost of SAN/NAS• A department can afford its own...
NOW WHAT?  © Hortonworks Inc. 2012                            ?   22
Analytics Apps: It takes a Team• Broad skill-set• Nobody has them all• Inherently collaborative      © Hortonworks Inc. 20...
Data Science Team• 3-4 team members with broad, diverse skill-sets that overlap• Transactional overhead dominates at 5+ pe...
How to get insight into product?• Back-end has gotten t-h-i-c-k-e-r• Generating $$$ insight can take 10-100x app dev• Time...
The Wrong Way - Part One“We made a great design. Your job is to predict the future for it.”        © Hortonworks Inc. 2012...
The Wrong Way - Part Two “Whats taking you so long to reliably predict the future?”     © Hortonworks Inc. 2012           ...
The Wrong Way - Part Three  “The users don’t understand what 86% true means.”   © Hortonworks Inc. 2012                   ...
The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!   © Hortonworks Inc. 2012                      ...
The Wrong Way - Inevitable Conclusion                              Plane   Mountain    © Hortonworks Inc. 2012            ...
Reminds me of... the waterfall model    © Hortonworks Inc. 2012   :(       31
Chief ProblemYou can’t design insight in analytics applications.                               You discover it.           ...
-> Strategy   So make an app for exploring your data.  Iterate and publish intermediate results. Which becomes a palette f...
Data Design• Not the 1st query that = insight, its the 15th, or the 150th• Capturing “Ah ha!” moments• Slow to do those in...
How do we get back to Agile?   © Hortonworks Inc. 2012     35
Statement of Principles                              (then tricks, with code)    © Hortonworks Inc. 2012                  ...
Setup an environment where...• Insights repeatedly produced• Iterative work shared with entire team• Interactive from day ...
Value document > relationMost data is dirty. Most data is semi-structured or un-structured. Rejoice!         © Hortonworks...
Value document > relationNote: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.          © Ho...
Relational Data = Legacy Format• Why JOIN? Storage is fundamentally cheap!• Duplicate that JOIN data in one big record typ...
Value imperative > declarative• We don’t know what we want to SELECT.• Data is dirty - check each step, clean iteratively....
Value dataflow > SELECT   © Hortonworks Inc. 2012   42
Ex. dataflow: ETL + email sent count                               (I can’t read this either. Get a big version here     ©...
Value Pig > Hive (for app-dev)• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is ...
Localhost vs Petabyte scale: same toolstools• Simplicity essential to scalability: highest level tools we can• Prepare a g...
Data-Value Pyramid               Climb it. Do not skip steps. See here.   © Hortonworks Inc. 2012                         ...
0/1) Display atomic records on the web    © Hortonworks Inc. 2012              47
0.0) Document-serialize events• Protobuf• Thrift• JSON• Avro - I use Avro because the schema is onboard.       © Hortonwor...
0.1) Documents via Relation ETLenron_messages = load /enron/enron_messages.tsv as (     message_id:chararray,     sql_date...
0.2) Serialize events from streamsclass GmailSlurper(object):  ...  def init_imap(self, username, password):    self.usern...
0.3) ETL Logslog_data = LOAD access_log   USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader   AS (remoteAd...
1) Plumb atomic events -> browser     (Example stack that enables high productivity)   © Hortonworks Inc. 2012            ...
Lots of Stack Options with Examples• Pig with Voldemort, Ruby, Sinatra: example• Pig with ElasticSearch: example• Pig with...
1.1) cat our Avro serialized eventsme$ cat_avro ~/Data/enron.avro{  ubccs: [], ubody: uscamming people, blah blah, uccs:28...
1.2) Load our events in Pigme$ pig -l /tmp -x local -v -wgrunt> enron_emails = LOAD /enron/emails.avro USING AvroStorage()...
1.3) ILLUSTRATE our events in Piggrunt> illustrate enron_emails-----------------------------------------------------------...
1.4) Publish our events to a ‘database’From Avro to MongoDB in one command:pig -l /tmp -x local -v -w -param avros=enron.a...
1.5) Check events in our ‘database’$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemaildb....
1.6) Publish events on the webrequire rubygemsrequire sinatrarequire mongorequire jsonconnection = Mongo::Connection.newda...
1.6) Publish events on the web    © Hortonworks Inc. 2012      60
Whats the point?• A designer can work against real data.• An application developer can work against real data.• A product ...
1.7) Wrap events with Bootstrap<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"></head><body>...
1.7) Wrap events with Bootstrap    © Hortonworks Inc. 2012       63
Refine. Add links between documents.                       Not the Mona Lisa, but coming along... See: here   © Hortonwork...
1.8) List links to sorted eventsUse Pig, serve/cache a bag/array of email documents:pig -l /tmp -x local -v -wemails_per_u...
1.8) List links to sorted documents    © Hortonworks Inc. 2012           67
1.9) Make it searchable...If you have list, search is easy with ElasticSearch and Wonderdog.../* Load ElasticSearch integr...
From now on we speed up...        Don’t worry, its in the book and on the blog.                             http://hortonw...
2) Create Simple Charts   © Hortonworks Inc. 2012   70
2) Create Simple Tables and Charts   © Hortonworks Inc. 2012           71
2) Create Simple Charts• Start with an HTML table on general principle.• Then use nvd3.js - reusable charts for d3.js• Agg...
2.1) Top N (of anything) in Pigpig -l /tmp -x local -v -wtop_things = foreach (group things by key) {sorted = order things...
2.2) Time Series (of anything) in Pigpig -l /tmp -x local -v -w/* Group by our key and date rounded to the month, get a to...
Data processing in our stackA new feature in our application might begin at any layer... great!                           ...
Data processing in our stack... but we shift the data-processing towards batch, as we are able.                           ...
3) Exploring with Reports    © Hortonworks Inc. 2012   77
3) Exploring with Reports    © Hortonworks Inc. 2012   78
3.0) From charts to reports...• Extract entities from properties we aggregated by in charts (Step 2)• Each entity gets its...
3.1) Looks like this...    © Hortonworks Inc. 2012   80
3.2) Cultivate common keyspaces   © Hortonworks Inc. 2012        81
3.3) Get people clicking. Learn.• Explore this web of generated pages, charts and links!• Everyone on the team gets to kno...
4) Predictions and Recommendations   © Hortonworks Inc. 2012           83
4.0) Preparation• We’ve already extracted entities, their properties and relationships• Our charts show where our signal i...
4.2) Think in different perspectives• Networks• Time Series / Distributions• Natural Language Processing• Conditional Prob...
4.3) Networks   © Hortonworks Inc. 2012   86
4.3.1) Weighted Email Networks in PigDEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($...
4.3.2) Networks Viz with Gephi    © Hortonworks Inc. 2012      88
4.3.3) Gephi = Easy   © Hortonworks Inc. 2012   89
4.3.4) Social Network Analysis    © Hortonworks Inc. 2012      90
4.4) Time Series & Distributionspig -l /tmp -x local -v -w/* Count things per day */things_per_day = foreach (group things...
4.4.1) Smooth Sparse Data   © Hortonworks Inc. 2012   See here.   92
4.4.2) Regress to find TrendsJRuby Linear Regression UDF      Pig to use the UDF                                 Trend Lin...
4.5.1) Natural Language Processing import tfidf.macro; my_tf_idf_scores = tf_idf(id_body, message_id, body); /* Get the to...
4.5.2) NLP: Extract Topics!    © Hortonworks Inc. 2012   95
4.5.3) NLP for All: Extract Topics!• TF-IDF in Pig - 2 lines of code with Pig Macros:• http://hortonworks.com/blog/pig-mac...
4.6) Probability & Bayesian Inference     © Hortonworks Inc. 2012            97
4.6.1) Gmail Suggested Recipients   © Hortonworks Inc. 2012          98
4.6.1) Reproducing it with Pig...     © Hortonworks Inc. 2012        99
4.6.2) Step 1: COUNT(From -> To)   © Hortonworks Inc. 2012         100
4.6.2) Step 2: COUNT(From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone       ...
4.6.3) Wait - Stop Here! It works!                               They match...    © Hortonworks Inc. 2012                102
4.4) Add predictions to reports    © Hortonworks Inc. 2012       103
5) Enable new actions   © Hortonworks Inc. 2012   104
Why doesn’t Kate reply to my emails?• What time is best to catch her?• Are they too long?• Are they meant to be replied to...
Example: LinkedIn InMaps  Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/3162887480966957659864125703414800...
Example: Packetpig and PacketLoopsnort_alerts = LOAD $pcap  USINGcom.packetloop.packetpig.loaders.pcap.detection.SnortLoad...
Example: Packetpig and PacketLoop   © Hortonworks Inc. 2012          108
Thank You!Questions & AnswersSlides: http://slidesha.re/T943VUFollow: @hortonworks and @rjurneyRead: hortonworks.com/blog ...
Upcoming SlideShare
Loading in...5
×

LA HUG - Agile Analytics Applications on HDP

1,358

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,358
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
44
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "LA HUG - Agile Analytics Applications on HDP"

  1. 1. Agile Analytics Applicationson HDPRussell Jurney (@rjurney) - Hadoop Evangelist @HortonworksFormerly Viz, Data Science at Ning, LinkedInHBase Dashboards, Career Explorer, InMaps© Hortonworks Inc. 2012 2
  2. 2. About me... Bearding.• I’m going to beat this guy• Seriously• Bearding is my #1 natural talent• Salty Sea Beard• Fortified with Pacific Ocean Minerals © Hortonworks Inc. 2012 3
  3. 3. Agile Data - The Book (July, 2013) Read on Safari Rough Cuts Early Release Here Code Here © Hortonworks Inc. 2012 4
  4. 4. We go fast... but don’t worry!• Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney• Download the slides - click the links - read examples!• If its not on the blog, its in the book!• Order now: http://shop.oreilly.com/product/0636920025054.do• Read the book Friday on Safari Rough Cuts © Hortonworks Inc. 2012 5
  5. 5. HDP Sandbox - Talk Lessons Coming! © Hortonworks Inc. 2012 6
  6. 6. Agile Application Development: Check• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility + NoSQL © Hortonworks Inc. 2012 8
  7. 7. Data Warehousing © Hortonworks Inc. 2012 9
  8. 8. Scientific Computing / HPC • ‘Smart kid’ only: MPI, Globus, etc. until HadoopTubes and Mercury (old school) Cores and Spindles (new school) UNIVAC and Deep Blue both fill a warehouse. We’re back... © Hortonworks Inc. 2012 10
  9. 9. Data Science? Application Data WarehousingDevelopment Scientific Computing / HPC © Hortonworks Inc. 2012 11
  10. 10. Data Center as Computer • Warehouse Scale Computers and applications“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.”Click here for a paper on operating a ‘data center as computer.’ © Hortonworks Inc. 2012 12
  11. 11. © Hortonworks Inc. 2012 13
  12. 12. © Hortonworks Inc. 2012 14
  13. 13. © Hortonworks Inc. 2012 15
  14. 14. © Hortonworks Inc. 2012 16
  15. 15. © Hortonworks Inc. 2012 17
  16. 16. 18
  17. 17. Tez – Faster MapReduce! 19
  18. 18. Hadoop to the Rescue! © Hortonworks Inc. 2012 20
  19. 19. Hadoop to the Rescue!• Easy to use! (Pig, Hive, Cascading)• CHEAP: 1% the cost of SAN/NAS• A department can afford its own Hadoop cluster!• Dump all your data in one place: Hadoop DFS• Silos come CRASHING DOWN!• JOIN like crazy!• ETL like whoah!• An army of mappers and reducers at your command• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! © Hortonworks Inc. 2012 21
  20. 20. NOW WHAT? © Hortonworks Inc. 2012 ? 22
  21. 21. Analytics Apps: It takes a Team• Broad skill-set• Nobody has them all• Inherently collaborative © Hortonworks Inc. 2012 23
  22. 22. Data Science Team• 3-4 team members with broad, diverse skill-sets that overlap• Transactional overhead dominates at 5+ people• Expert researchers: lend 25-50% of their time to teams• Creative workers. Run like a studio, not an assembly line• Total freedom... with goals and deliverables.• Work environment matters most © Hortonworks Inc. 2012 24
  23. 23. How to get insight into product?• Back-end has gotten t-h-i-c-k-e-r• Generating $$$ insight can take 10-100x app dev• Timeline disjoint: analytics vs agile app-dev/design• How do you ship insights efficiently?• How do you collaborate on research vs developer timeline? © Hortonworks Inc. 2012 25
  24. 24. The Wrong Way - Part One“We made a great design. Your job is to predict the future for it.” © Hortonworks Inc. 2012 26
  25. 25. The Wrong Way - Part Two “Whats taking you so long to reliably predict the future?” © Hortonworks Inc. 2012 27
  26. 26. The Wrong Way - Part Three “The users don’t understand what 86% true means.” © Hortonworks Inc. 2012 28
  27. 27. The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! © Hortonworks Inc. 2012 29
  28. 28. The Wrong Way - Inevitable Conclusion Plane Mountain © Hortonworks Inc. 2012 30
  29. 29. Reminds me of... the waterfall model © Hortonworks Inc. 2012 :( 31
  30. 30. Chief ProblemYou can’t design insight in analytics applications. You discover it. You discover by exploring. © Hortonworks Inc. 2012 32
  31. 31. -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. © Hortonworks Inc. 2012 33
  32. 32. Data Design• Not the 1st query that = insight, its the 15th, or the 150th• Capturing “Ah ha!” moments• Slow to do those in batch...• Faster, better context in an interactive web application.• Pre-designed charts wind up terrible. So bad.• Easy to invest man-years in the wrong statistical models• Semantics of presenting predictions are complex, delicate• Opportunity lies at intersection of data & design © Hortonworks Inc. 2012 34
  33. 33. How do we get back to Agile? © Hortonworks Inc. 2012 35
  34. 34. Statement of Principles (then tricks, with code) © Hortonworks Inc. 2012 36
  35. 35. Setup an environment where...• Insights repeatedly produced• Iterative work shared with entire team• Interactive from day 0• Data model is consistent end-to-end• Minimal impedance between layers• Scope and depth of insights grow• Insights form the palette for what you ship• Until the application pays for itself and more © Hortonworks Inc. 2012 37
  36. 36. Value document > relationMost data is dirty. Most data is semi-structured or un-structured. Rejoice! © Hortonworks Inc. 2012 38
  37. 37. Value document > relationNote: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. © Hortonworks Inc. 2012 39
  38. 38. Relational Data = Legacy Format• Why JOIN? Storage is fundamentally cheap!• Duplicate that JOIN data in one big record type!• ETL once to document format on import, NOT every job• Not zero JOINs, but far fewer JOINs• Semi-structured documents preserve data’s actual structure• Column compressed document formats beat JOINs! (paper coming) © Hortonworks Inc. 2012 40
  39. 39. Value imperative > declarative• We don’t know what we want to SELECT.• Data is dirty - check each step, clean iteratively.• 85% of data scientist’s time spent munging. See: ETL.• Imperative is optimized for our process.• Process = iterative, snowballing insight• Efficiency matters, self optimize © Hortonworks Inc. 2012 41
  40. 40. Value dataflow > SELECT © Hortonworks Inc. 2012 42
  41. 41. Ex. dataflow: ETL + email sent count (I can’t read this either. Get a big version here © Hortonworks Inc. 2012 43
  42. 42. Value Pig > Hive (for app-dev)• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. © Hortonworks Inc. 2012 44
  43. 43. Localhost vs Petabyte scale: same toolstools• Simplicity essential to scalability: highest level tools we can• Prepare a good sample - tricky with joins, easy with documents• Local mode: pig -l /tmp -x local -v -w• Frequent use of ILLUSTRATE• 1st: Iterate, debug & publish locally• 2nd: Run on cluster, publish to team/customer• Consider skipping Object-Relational-Mapping (ORM)• We do not trust ‘databases,’ only HDFS @ n=3.• Everything we serve in our app is re-creatable via Hadoop. © Hortonworks Inc. 2012 45
  44. 44. Data-Value Pyramid Climb it. Do not skip steps. See here. © Hortonworks Inc. 2012 46
  45. 45. 0/1) Display atomic records on the web © Hortonworks Inc. 2012 47
  46. 46. 0.0) Document-serialize events• Protobuf• Thrift• JSON• Avro - I use Avro because the schema is onboard. © Hortonworks Inc. 2012 48
  47. 47. 0.1) Documents via Relation ETLenron_messages = load /enron/enron_messages.tsv as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray);enron_recipients = load /enron/enron_recipients.tsv as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);split enron_recipients into tos IF reciptype==to, ccs IF reciptype==cc, bccs IF reciptype==bcc;headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;with_headers = join headers by group, enron_messages by message_id parallel 10;emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, yyyy-MM-dd HH:mm:ss) as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs;store emails into /enron/emails.avro using AvroStorage( Example here. © Hortonworks Inc. 2012 49
  48. 48. 0.2) Serialize events from streamsclass GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL(imap.gmail.com, 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == OK and charset and thread_id in email_hash and froms in email_hash): print email_id, charset, email_hash[thread_id] self.write(email_hash) © Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 50
  49. 49. 0.3) ETL Logslog_data = LOAD access_log USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); © Hortonworks Inc. 2012 51
  50. 50. 1) Plumb atomic events -> browser (Example stack that enables high productivity) © Hortonworks Inc. 2012 52
  51. 51. Lots of Stack Options with Examples• Pig with Voldemort, Ruby, Sinatra: example• Pig with ElasticSearch: example• Pig with MongoDB, Node.js: example• Pig with Cassandra, Python Streaming, Flask: example• Pig with HBase, JRuby, Sinatra: example• Pig with Hive via HCatalog: example (trivial on HDP)• Up next: Accumulo, Redis, MySQL, etc. © Hortonworks Inc. 2012 53
  52. 52. 1.1) cat our Avro serialized eventsme$ cat_avro ~/Data/enron.avro{ ubccs: [], ubody: uscamming people, blah blah, uccs:28T01:50:00.000Z, ufrom: {uaddress: ubob.dobbs@enron.comumessage_id: u<1731.10095812390082.JavaMail.evans@thyme>, utrade for frop futures, utos: [ {uaddress: uconnie@enron.com, uname: None} ]} © Hortonworks Inc. 2012 Get cat_avro in python, ruby 54
  53. 53. 1.2) Load our events in Pigme$ pig -l /tmp -x local -v -wgrunt> enron_emails = LOAD /enron/emails.avro USING AvroStorage();grunt> describe enron_emailsemails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}} © Hortonworks Inc. 2012 55
  54. 54. 1.3) ILLUSTRATE our events in Piggrunt> illustrate enron_emails---------------------------------------------------------------------------| emails || message_id:chararray || datetime:chararray || from:tuple(address:chararray,name:chararray) || subject:chararray || body:chararray |tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| || <1731.10095812390082.JavaMail.evans@thyme> || 2001-01-09T06:38:00.000Z || (bob.dobbs@enron.com, J.R. Bob Dobbs) || Re: Enron trade for frop futures || scamming people, blah blah || {(connie@enron.com,)} || {} || {} | Upgrade to Pig 0.10+ © Hortonworks Inc. 2012 56
  55. 55. 1.4) Publish our events to a ‘database’From Avro to MongoDB in one command:pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl=mongodb://localhost/enron.emails avro_to_mongo.pigWhich does this:/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarrehadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pigSNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records inmapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execcom.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 redu= load $avros using AvroStorage();store avros into $mongourl using MongoStorage( © Hortonworks Inc. 2012 Full instructions here. 57
  56. 56. 1.5) Check events in our ‘database’$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemaildb.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaM"2001-01-09T06:38:00.000Z", "from" : { "address" : "bob.dobbs@enron.com", "nam"subject" : Re: Enron trade for frop futures, "body" : "Scamming more people..."connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ]} © Hortonworks Inc. 2012 58
  57. 57. 1.6) Publish events on the webrequire rubygemsrequire sinatrarequire mongorequire jsonconnection = Mongo::Connection.newdatabase = connection[agile_data]collection = database[emails]get /email/:message_id do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end © Hortonworks Inc. 2012 59
  58. 58. 1.6) Publish events on the web © Hortonworks Inc. 2012 60
  59. 59. Whats the point?• A designer can work against real data.• An application developer can work against real data.• A product manager can think in terms of real data.• Entire team is grounded in reality!• You’ll see how ugly your data really is.• You’ll see how much work you have yet to do.• Ship early and often!• Feels agile, don’t it? Keep it up! © Hortonworks Inc. 2012 61
  60. 60. 1.7) Wrap events with Bootstrap<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"></head><body><div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data[keys] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data[values] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table></div> Complete example here with code here.</body> © Hortonworks Inc. 2012 62
  61. 61. 1.7) Wrap events with Bootstrap © Hortonworks Inc. 2012 63
  62. 62. Refine. Add links between documents. Not the Mona Lisa, but coming along... See: here © Hortonworks Inc. 2012 64
  63. 63. 1.8) List links to sorted eventsUse Pig, serve/cache a bag/array of email documents:pig -l /tmp -x local -v -wemails_per_user = foreach (group emails by from.address) {sorted = order emails by date;last_1000 = limit sorted 1000;generate group as from_address, emails as emails;};store emails_per_user into $mongourl using MongoStorage();Use your ‘database’, if it can sort.mongo enron> db.emails.ensureIndex({message_id: 1})> db.emails.find().sort({date:0}).limit(10).pretty() { { "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", "from" : [ ... © Hortonworks Inc. 2012 66
  64. 64. 1.8) List links to sorted documents © Hortonworks Inc. 2012 67
  65. 65. 1.9) Make it searchable...If you have list, search is easy with ElasticSearch and Wonderdog.../* Load ElasticSearch integration */register /me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar;register /me/elasticsearch-0.18.6/lib/*;define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();emails = load /me/tmp/emails using AvroStorage();store emails into es://email/email?json=false&size=1000 usingElasticSearch(/me/elasticsearch-0.18.6/config/elasticsearch.yml, /me/elasticsearch-0.18.6/plugins);Test it with curl: curl -XGET http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1ElasticSearch has no security features. Take note. Isolate. © Hortonworks Inc. 2012 68
  66. 66. From now on we speed up... Don’t worry, its in the book and on the blog. http://hortonworks.com/blog/ © Hortonworks Inc. 2012 69
  67. 67. 2) Create Simple Charts © Hortonworks Inc. 2012 70
  68. 68. 2) Create Simple Tables and Charts © Hortonworks Inc. 2012 71
  69. 69. 2) Create Simple Charts• Start with an HTML table on general principle.• Then use nvd3.js - reusable charts for d3.js• Aggregate by properties & displaying is first step in entity resolution• Start extracting entities. Ex: people, places, topics, time series• Group documents by entities, rank and count.• Publish top N, time series, etc.• Fill a page with charts.• Add a chart to your event page. © Hortonworks Inc. 2012 72
  70. 70. 2.1) Top N (of anything) in Pigpig -l /tmp -x local -v -wtop_things = foreach (group things by key) {sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};store top_n into $mongourl using MongoStorage();Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. © Hortonworks Inc. 2012 73
  71. 71. 2.2) Time Series (of anything) in Pigpig -l /tmp -x local -v -w/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))generate flatten(group) as (key, month), COUNT_STAR(things) as total;/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) {timeseries = order things by month;generate group as key, timeseries as timeseries;};store things_timeseries into $mongourl using MongoStorage(); Yet another good Pig Macro. © Hortonworks Inc. 2012 74
  72. 72. Data processing in our stackA new feature in our application might begin at any layer... great! omghi2u! I’m creative! I’m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! © Hortonworks Inc. 2012 75
  73. 73. Data processing in our stack... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer © Hortonworks Inc. 2012 76
  74. 74. 3) Exploring with Reports © Hortonworks Inc. 2012 77
  75. 75. 3) Exploring with Reports © Hortonworks Inc. 2012 78
  76. 76. 3.0) From charts to reports...• Extract entities from properties we aggregated by in charts (Step 2)• Each entity gets its own type of web page• Each unique entity gets its own web page• Link to entities as they appear in atomic event documents (Step 1)• Link most related entities together, same and between types.• More visualizations!• Parametize results via forms. © Hortonworks Inc. 2012 79
  77. 77. 3.1) Looks like this... © Hortonworks Inc. 2012 80
  78. 78. 3.2) Cultivate common keyspaces © Hortonworks Inc. 2012 81
  79. 79. 3.3) Get people clicking. Learn.• Explore this web of generated pages, charts and links!• Everyone on the team gets to know your data.• Keep trying out different charts, metrics, entities, links.• See whats interesting.• Figure out what data needs cleaning and clean it.• Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. © Hortonworks Inc. 2012 82
  80. 80. 4) Predictions and Recommendations © Hortonworks Inc. 2012 83
  81. 81. 4.0) Preparation• We’ve already extracted entities, their properties and relationships• Our charts show where our signal is rich• We’ve cleaned our data to make it presentable• The entire team has an intuitive understanding of the data• They got that understanding by exploring the data• We are all on the same page! © Hortonworks Inc. 2012 84
  82. 82. 4.2) Think in different perspectives• Networks• Time Series / Distributions• Natural Language Processing• Conditional Probabilities / Bayesian Inference• Check out Chapter 2 of the book... here. © Hortonworks Inc. 2012 See 85
  83. 83. 4.3) Networks © Hortonworks Inc. 2012 86
  84. 84. 4.3.1) Weighted Email Networks in PigDEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACHfiltered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) ASego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD /me/Data/enron.avro USING AvroStorage();from_to =header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcGet a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) A(ego1, ego2), COUNT_STAR(pairs) AS total; © Hortonworks Inc. 2012 87
  85. 85. 4.3.2) Networks Viz with Gephi © Hortonworks Inc. 2012 88
  86. 86. 4.3.3) Gephi = Easy © Hortonworks Inc. 2012 89
  87. 87. 4.3.4) Social Network Analysis © Hortonworks Inc. 2012 90
  88. 88. 4.4) Time Series & Distributionspig -l /tmp -x local -v -w/* Count things per day */things_per_day = foreach (group things by (key, ISOToDay(datetime))generate flatten(group) as (key, day), COUNT_STAR(things) as total;/* Sort our totals per key by day to get a sorted time series */things_timeseries = foreach (group things_by_day by key) {timeseries = order things by day;generate group as key, timeseries as timeseries;};store things_timeseries into $mongourl using MongoStorage(); © Hortonworks Inc. 2012 91
  89. 89. 4.4.1) Smooth Sparse Data © Hortonworks Inc. 2012 See here. 92
  90. 90. 4.4.2) Regress to find TrendsJRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application © Hortonworks Inc. 2012 93
  91. 91. 4.5.1) Natural Language Processing import tfidf.macro; my_tf_idf_scores = tf_idf(id_body, message_id, body); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } © Hortonworks Inc. 2012 Example with code here and macro here. 94
  92. 92. 4.5.2) NLP: Extract Topics! © Hortonworks Inc. 2012 95
  93. 93. 4.5.3) NLP for All: Extract Topics!• TF-IDF in Pig - 2 lines of code with Pig Macros:• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes- topic-summarization-2-lines-of-pig/• LDA with Pig and the Lucene Tokenizer:• http://thedatachef.blogspot.be/2012/03/topic-discovery- with-apache-pig-and.html © Hortonworks Inc. 2012 96
  94. 94. 4.6) Probability & Bayesian Inference © Hortonworks Inc. 2012 97
  95. 95. 4.6.1) Gmail Suggested Recipients © Hortonworks Inc. 2012 98
  96. 96. 4.6.1) Reproducing it with Pig... © Hortonworks Inc. 2012 99
  97. 97. 4.6.2) Step 1: COUNT(From -> To) © Hortonworks Inc. 2012 100
  98. 98. 4.6.2) Step 2: COUNT(From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone © Hortonworks Inc. 2012 101
  99. 99. 4.6.3) Wait - Stop Here! It works! They match... © Hortonworks Inc. 2012 102
  100. 100. 4.4) Add predictions to reports © Hortonworks Inc. 2012 103
  101. 101. 5) Enable new actions © Hortonworks Inc. 2012 104
  102. 102. Why doesn’t Kate reply to my emails?• What time is best to catch her?• Are they too long?• Are they meant to be replied to (contain original content)?• Are they nice? (sentiment analysis)• Do I reply to her emails (reciprocity)?• Do I cc the wrong people (my mom) ? © Hortonworks Inc. 2012 105
  103. 103. Example: LinkedIn InMaps Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/316288748096695765986412570341480077402 <------ personalization drives engagement © Hortonworks Inc. 2012 106
  104. 104. Example: Packetpig and PacketLoopsnort_alerts = LOAD $pcap USINGcom.packetloop.packetpig.loaders.pcap.detection.SnortLoader($snortconfig)countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority;countries = GROUP countries BY country;countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity;STORE countries into output/choropleth_countries using PigStorage(,); Code here. © Hortonworks Inc. 2012 107
  105. 105. Example: Packetpig and PacketLoop © Hortonworks Inc. 2012 108
  106. 106. Thank You!Questions & AnswersSlides: http://slidesha.re/T943VUFollow: @hortonworks and @rjurneyRead: hortonworks.com/blog © Hortonworks Inc. 2012 109
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×