Agile analytics applications on hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Agile analytics applications on hadoop

on

  • 1,799 views

Presentation to match the book Agile Big Data

Presentation to match the book Agile Big Data

Statistics

Views

Total Views
1,799
Views on SlideShare
1,775
Embed Views
24

Actions

Likes
6
Downloads
38
Comments
0

1 Embed 24

https://twitter.com 24

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Agile analytics applications on hadoop Presentation Transcript

  • 1. Agile Analytics ApplicationsRussell Jurney1Wednesday, May 8, 13
  • 2. About me…Bearding.• I’m going to beat this guy• Seriously• Bearding is my #1 natural talent• Salty Sea Beard• Fortified with Pacific Ocean Minerals2Wednesday, May 8, 13
  • 3. Agile Data: The Book (August, 2013)3Read @ Safari Rough CutsA philosophy,not the only wayBut still, its good! Really!Wednesday, May 8, 13
  • 4. We go fast... but don’t worry!• Download the slides - click the links - read examples!• If its not on the blog, its in the book!• Order now: http://shop.oreilly.com/product/0636920025054.do• Read the book on Safari Rough Cuts4Wednesday, May 8, 13
  • 5. Agile Application Development: Check• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility5+ NoSQLWednesday, May 8, 13
  • 6. Data Warehousing6Wednesday, May 8, 13
  • 7. Scientific Computing / HPC• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop7Tubes and Mercury (old school) Cores and Spindles (new school)UNIVAC and Deep Blue both fill a warehouse. We’re back...Wednesday, May 8, 13
  • 8. Data Science?833%33%33%ApplicationDevelopmentData WarehousingScientific Computing / HPCWednesday, May 8, 13
  • 9. Data Center as Computer• Warehouse Scale Computers and applications9“A key challenge for architects of WSCs is to smooth out these discrepancies in a costefficient manner.” Click here for a paper on operating a ‘data center as computer.’Wednesday, May 8, 13
  • 10. Hadoop to the Rescue!• Easy to use! (Pig, Hive, Cascading)• CHEAP: 1% the cost of SAN/NAS• A department can afford its own Hadoop cluster!• Dump all your data in one place: Hadoop DFS• Silos come CRASHING DOWN!• JOIN like crazy!• ETL like whoah!• An army of mappers and reducers at your command• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!10Wednesday, May 8, 13
  • 11. NOWWHAT?11Wednesday, May 8, 13
  • 12. Analytics Apps: It takes a Team12• Broad skill-set• Nobody has them all• Inherently collaborativeWednesday, May 8, 13
  • 13. Data Science Team• 3-4 team members with broad, diverse skill-sets that overlap• Transactional overhead dominates at 5+ people• Expert researchers: lend 25-50% of their time to teams• Creative workers. Run like a studio, not an assembly line• Total freedom... with goals and deliverables.• Work environment matters most13Wednesday, May 8, 13
  • 14. How to get insight into product?• Back-end has gotten THICKER• Generating $$$ insight can take 10-100x app dev• Timeline disjoint: analytics vs agile app-dev/design• How do you ship insights efficiently?• How do you collaborate on research vs developer timeline?14Wednesday, May 8, 13
  • 15. The Wrong Way - Part One15“We made a great design.Your job is to predict the future for it.”Wednesday, May 8, 13
  • 16. The Wrong Way - Part Two16“What is taking you so longto reliably predict the future?”Wednesday, May 8, 13
  • 17. The Wrong Way - Part Three17“The users don’t understandwhat 86% true means.”Wednesday, May 8, 13
  • 18. The Wrong Way - Part Four18GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!Wednesday, May 8, 13
  • 19. The Wrong Way - Inevitable Conclusion19Plane MountainWednesday, May 8, 13
  • 20. Reminds me of... the waterfall model20:(Wednesday, May 8, 13
  • 21. Chief Problem21You can’t design insight in analytics applications.You discover it.You discover by exploring.Wednesday, May 8, 13
  • 22. -> Strategy22So make an app for exploring your data.Which becomes a palette for what you ship.Iterate and publish intermediate results.Wednesday, May 8, 13
  • 23. Data Design• Not the 1st query that = insight, it’s the 15th, or the 150th• Capturing “Ah ha!” moments• Slow to do those in batch…• Faster, better context in an interactive web application.• Pre-designed charts wind up terrible. So bad.• Easy to invest man-years in the wrong statistical models• Semantics of presenting predictions are complex, delicate• Opportunity lies at intersection of data & design23Wednesday, May 8, 13
  • 24. How do we get back to Agile?24Wednesday, May 8, 13
  • 25. Statement of Principles25(then tricks, with code)Wednesday, May 8, 13
  • 26. Setup an environment where...• Insights repeatedly produced• Iterative work shared with entire team• Interactive from day Zero• Data model is consistent end-to-end• Minimal impedance between layers• Scope and depth of insights grow• Insights form the palette for what you ship• Until the application pays for itself and more26Wednesday, May 8, 13
  • 27. Snowballing Audience27Wednesday, May 8, 13
  • 28. Value document > relation28Most data is dirty. Most data is semi-structured or unstructured. Rejoice!Wednesday, May 8, 13
  • 29. Value document > relation29Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.Wednesday, May 8, 13
  • 30. Relational Data = Legacy Format• Why JOIN? Storage is fundamentally cheap!• Duplicate that JOIN data in one big record type!• ETL once to document format on import, NOT every job• Not zero JOINs, but far fewer JOINs• Semi-structured documents preserve data’s actual structure• Column compressed document formats beat JOINs!30Wednesday, May 8, 13
  • 31. Value imperative > declarative• We don’t know what we want to SELECT.• Data is dirty - check each step, clean iteratively.• 85% of data scientist’s time spent munging. See: ETL.• Imperative is optimized for our process.• Process = iterative, snowballing insight• Efficiency matters, self optimize31Wednesday, May 8, 13
  • 32. Value dataflow > SELECT32Wednesday, May 8, 13
  • 33. Ex. dataflow: ETL + email sent count33(I can’t read this either. Get a big version here.)Wednesday, May 8, 13
  • 34. Value Pig > Hive (for app-dev)• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.34But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive…See: HCatalog for Pig/Hive integration.Wednesday, May 8, 13
  • 35. Localhost vs Petabyte scale: same tools• Simplicity essential to scalability: highest level tools we can• Prepare a good sample - tricky with joins, easy withdocuments• Local mode: pig -l /tmp -x local -v -w• Frequent use of ILLUSTRATE• 1st: Iterate, debug & publish locally• 2nd: Run on cluster, publish to team/customer• Consider skipping Object-Relational-Mapping (ORM)• We do not trust ‘databases,’ only HDFS @ n=3.• Everything we serve in our app is re-creatable via Hadoop.35Wednesday, May 8, 13
  • 36. Data-Value Pyramid36Climb it. Do not skip steps. See here.Wednesday, May 8, 13
  • 37. 0/1) Display atomic records on the web37Wednesday, May 8, 13
  • 38. 0.0) Document-serialize events• Protobuf• Thrift• JSON• Avro - I use Avro because the schema is onboard.38Wednesday, May 8, 13
  • 39. 0.1) Documents via Relation ETL39enron_messages = load /enron/enron_messages.tsv as (message_id:chararray,sql_date:chararray,from_address:chararray,from_name:chararray,subject:chararray,body:chararray); enron_recipients = load /enron/enron_recipients.tsv as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); split enron_recipients into tos IF reciptype==to, ccs IF reciptype==cc, bccs IF reciptype==bcc; headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;with_headers = join headers by group, enron_messages by message_id parallel 10;emails = foreach with_headers generate enron_messages::message_id as message_id,CustomFormatToISO(enron_messages::sql_date, yyyy-MM-dd HH:mm:ss) as date,TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),enron_messages::subject as subject,enron_messages::body as body,headers::tos.(address, name) as tos,headers::ccs.(address, name) as ccs,headers::bccs.(address, name) as bccs;store emails into /enron/emails.avro using AvroStorage(Example here.Wednesday, May 8, 13
  • 40. 0.2) Serialize events from streams40class  GmailSlurper(object):    ...    def  init_imap(self,  username,  password):        self.username  =  username        self.password  =  password        try:            imap.shutdown()        except:            pass        self.imap  =  imaplib.IMAP4_SSL(imap.gmail.com,  993)        self.imap.login(username,  password)        self.imap.is_readonly  =  True    ...    def  write(self,  record):        self.avro_writer.append(record)    ...    def  slurp(self):        if(self.imap  and  self.imap_folder):            for  email_id  in  self.id_list:                (status,  email_hash,  charset)  =  self.fetch_email(email_id)                if(status  ==  OK  and  charset  and  thread_id  in  email_hash  and  froms  in  email_hash):                    print  email_id,  charset,  email_hash[thread_id]                    self.write(email_hash)Scrape your own gmail in Python and Ruby.Wednesday, May 8, 13
  • 41. 0.3) ETL Logs41log_data  =  LOAD  access_log        USING  org.apache.pig.piggybank.storage.apachelog.CommongLogLoader        AS  (remoteAddr,                remoteLogname,                user,                time,                method,                uri,                proto,                bytes);Wednesday, May 8, 13
  • 42. 1) Plumb atomic events -> browser42(Example stack that enables high productivity)Wednesday, May 8, 13
  • 43. 1.1) cat our Avro serialized events43me$ cat_avro ~/Data/enron.avro{ubccs: [],ubody: uscamming people, blah blah,uccs: [],udate: u2000-08-28T01:50:00.000Z,ufrom: {uaddress: ubob.dobbs@enron.com, uname: None},umessage_id: u<1731.10095812390082.JavaMail.evans@thyme>,usubject: uRe: Enron trade for frop futures,utos: [{uaddress: uconnie@enron.com, uname: None}]}Get cat_avro in python, rubyWednesday, May 8, 13
  • 44. 1.2) Load our events in Pig44me$ pig -l /tmp -x local -v -wgrunt> enron_emails = LOAD /enron/emails.avro USING AvroStorage();grunt> describe enron_emailsemails: {message_id: chararray,datetime: chararray,from:tuple(address:chararray,name:chararray)subject: chararray,body: chararray,tos: {to: (address: chararray,name: chararray)},ccs: {cc: (address: chararray,name: chararray)},bccs: {bcc: (address: chararray,name: chararray)}} Wednesday, May 8, 13
  • 45. 1.3) ILLUSTRATE our events in Pig45grunt> illustrate enron_emails ---------------------------------------------------------------------------| emails || message_id:chararray || datetime:chararray || from:tuple(address:chararray,name:chararray) || subject:chararray || body:chararray || tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| || <1731.10095812390082.JavaMail.evans@thyme> || 2001-01-09T06:38:00.000Z || (bob.dobbs@enron.com, J.R. Bob Dobbs) || Re: Enron trade for frop futures || scamming people, blah blah || {(connie@enron.com,)} || {} || {} |Upgrade to Pig 0.10+Wednesday, May 8, 13
  • 46. 1.4) Publish our events to a ‘database’46pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl=mongodb://localhost/enron.emails avro_to_mongo.pig/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 reducers */set default_parallel 5avros = load $avros using AvroStorage();store avros into $mongourl using MongoStorage();Full instructions here.Which does this:From Avro to MongoDB in one command:Wednesday, May 8, 13
  • 47. 1.5) Check events in our ‘database’47$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemailssystem.indexes> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){" "_id" : ObjectId("502b4ae703643a6a49c8d180")," "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>"," "date" : "2001-01-09T06:38:00.000Z"," "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }," "subject" : Re: Enron trade for frop futures," "body" : "Scamming more people..."," "tos" : [ { "address" : "connie@enron", "name" : null } ]," "ccs" : [ ]," "bccs" : [ ]}Wednesday, May 8, 13
  • 48. 1.6) Publish events on the web48require rubygemsrequire sinatrarequire mongorequire jsonconnection = Mongo::Connection.newdatabase = connection[agile_data]collection = database[emails]get /email/:message_id do |message_id|data = collection.find_one({:message_id => message_id})JSON.generate(data)endWednesday, May 8, 13
  • 49. 1.6) Publish events on the web49Wednesday, May 8, 13
  • 50. One-Liner to Transition Stack50Wednesday, May 8, 13
  • 51. Whats the point?• A designer can work against real data.• An application developer can work against real data.• A product manager can think in terms of real data.• Entire team is grounded in reality!• You’ll see how ugly your data really is.• You’ll see how much work you have yet to do.• Ship early and often!• Feels agile, don’t it? Keep it up!51Wednesday, May 8, 13
  • 52. 1.7) Wrap events with Bootstrap52<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"></head><body><div class="container" style="margin-top: 100px;"><table class="table table-striped table-bordered table-condensed"><thead>{% for key in data[keys] %}<th>{{ key }}</th>{% endfor %}</thead><tbody><tr>{% for value in data[values] %}<td>{{ value }}</td>{% endfor %}</tr></tbody></table></div></body> Complete example here with code here.Wednesday, May 8, 13
  • 53. 1.7) Wrap events with Bootstrap53Wednesday, May 8, 13
  • 54. Refine. Add links between documents.54Not the Mona Lisa, but coming along... See: hereWednesday, May 8, 13
  • 55. 1.8) List links to sorted events55mongo enron> db.emails.ensureIndex({message_id: 1})> db.emails.find().sort({date:0}).limit(10).pretty(){{" "_id" : ObjectId("4f7a5da2414e4dd0645d1176")," "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>"," "from" : [...pig -l /tmp -x local -v -wemails_per_user = foreach (group emails by from.address) {sorted = order emails by date;last_1000 = limit sorted 1000;generate group as from_address, emails as emails;};store emails_per_user into $mongourl using MongoStorage();Use Pig, serve/cache a bag/array of email documents:Use your ‘database’, if it can sort.Wednesday, May 8, 13
  • 56. 1.8) List links to sorted documents56Wednesday, May 8, 13
  • 57. 1.9) Make it searchable...57If you have list, search is easy with ElasticSearch and Wonderdog.../* Load ElasticSearch integration */register /me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar;register /me/elasticsearch-0.18.6/lib/*;define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();emails = load /me/tmp/emails using AvroStorage();store emails into es://email/email?json=false&size=1000 using ElasticSearch(/me/elasticsearch-0.18.6/config/elasticsearch.yml, /me/elasticsearch-0.18.6/plugins);curl -XGET http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1Test it with curl:ElasticSearch has no security features. Take note. Isolate.Wednesday, May 8, 13
  • 58. 2) Create Simple Charts58Wednesday, May 8, 13
  • 59. 2) Create Simple Tables and Charts59Wednesday, May 8, 13
  • 60. 2) Create Simple Charts• Start with an HTML table on general principle.• Then use nvd3.js - reusable charts for d3.js• Aggregate by properties & displaying is first step in entityresolution• Start extracting entities. Ex: people, places, topics, time series• Group documents by entities, rank and count.• Publish top N, time series, etc.• Fill a page with charts.• Add a chart to your event page.60Wednesday, May 8, 13
  • 61. 2.1) Top N (of anything) in Pig61pig -l /tmp -x local -v -wtop_things = foreach (group things by key) {sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};store top_n into $mongourl using MongoStorage();Remember, this is the same structure the browser gets as json.This would make a good Pig Macro.Wednesday, May 8, 13
  • 62. 2.2) Time Series (of anything) in Pig62pig -l /tmp -x local -v -w/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))generate flatten(group) as (key, month),COUNT_STAR(things) as total;/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) {timeseries = order things by month;generate group as key, timeseries as timeseries;};store things_timeseries into $mongourl using MongoStorage();Yet another good Pig Macro.Wednesday, May 8, 13
  • 63. Data processing in our stack63A new feature in our application might begin at any layer... great!Any team member can add new features, no problemo!I’m creative!I know Pig!I’m creative too!I <3 Javascript!omghi2u!where r my legs?send halpWednesday, May 8, 13
  • 64. Data processing in our stack64... but we shift the data-processing towards batch, as we are able.Ex: Overall total emails calculated in each layerSee real example here.Wednesday, May 8, 13
  • 65. 3) Exploring with Reports65Wednesday, May 8, 13
  • 66. 3) Exploring with Reports66Wednesday, May 8, 13
  • 67. 3.0) From charts to reports...• Extract entities from properties we aggregated by in charts (Step 2)• Each entity gets its own type of web page• Each unique entity gets its own web page• Link to entities as they appear in atomic event documents (Step 1)• Link most related entities together, same and between types.• More visualizations!• Parametize results via forms.67Wednesday, May 8, 13
  • 68. 3.1) Looks like this...68Wednesday, May 8, 13
  • 69. 3.2) Cultivate common keyspaces69Wednesday, May 8, 13
  • 70. 3.3) Get people clicking. Learn.• Explore this web of generated pages, charts and links!• Everyone on the team gets to know your data.• Keep trying out different charts, metrics, entities, links.• See whats interesting.• Figure out what data needs cleaning and clean it.• Start thinking about predictions & recommendations.70‘People’ could be just your team, if data is sensitive.Wednesday, May 8, 13
  • 71. 4) Predictions and Recommendations71Wednesday, May 8, 13
  • 72. 4.0) Preparation• We’ve already extracted entities, their properties and relationships• Our charts show where our signal is rich• We’ve cleaned our data to make it presentable• The entire team has an intuitive understanding of the data• They got that understanding by exploring the data• We are all on the same page!72Wednesday, May 8, 13
  • 73. 4.2) Think in different perspectives• Networks• Time Series / Distributions• Natural Language Processing• Conditional Probabilities / Bayesian Inference• Check out Chapter 2 of the book73Wednesday, May 8, 13
  • 74. 4.3) Networks74Wednesday, May 8, 13
  • 75. 4.3.1) Weighted Email Networks in Pig75DEFINE header_pairs(email, col1, col2) RETURNS pairs {filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL);flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2;$pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD /me/Data/enron.avro USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcc;/* Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;Wednesday, May 8, 13
  • 76. 4.3.2) Networks Viz with Gephi76Wednesday, May 8, 13
  • 77. 4.3.3) Gephi = Easy77Wednesday, May 8, 13
  • 78. 4.3.4) Social Network Analysis78Wednesday, May 8, 13
  • 79. 4.4) Time Series & Distributions79pig -l /tmp -x local -v -w/* Count things per day */things_per_day = foreach (group things by (key, ISOToDay(datetime))generate flatten(group) as (key, day),COUNT_STAR(things) as total;/* Sort our totals per key by day to get a sorted time series */things_timeseries = foreach (group things_by_day by key) {timeseries = order things by day;generate group as key, timeseries as timeseries;};store things_timeseries into $mongourl using MongoStorage();Wednesday, May 8, 13
  • 80. 4.4.1) Smooth Sparse Data80See here.Wednesday, May 8, 13
  • 81. 4.4.2) Regress to find Trends81JRuby Linear Regression UDF Pig to use the UDFTrend Line in your ApplicationWednesday, May 8, 13
  • 82. 4.5.1) Natural Language Processing82Example with code here and macro here.import tfidf.macro;my_tf_idf_scores = tf_idf(id_body, message_id, body);/* Get the top 10 Tf*Idf scores per message */per_message_cassandra = foreach (group tfidf_all by message_id) {sorted = order tfidf_all by value desc;top_10_topics = limit sorted 10;generate group, top_10_topics.(score, value);}Wednesday, May 8, 13
  • 83. 4.5.2) NLP: Extract Topics!83Wednesday, May 8, 13
  • 84. 4.5.3) NLP for All: Extract Topics!• TF-IDF in Pig - 2 lines of code with Pig Macros:• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/• LDA with Pig and the Lucene Tokenizer:• http://thedatachef.blogspot.be/2012/03/topic-discovery-with-apache-pig-and.html84Wednesday, May 8, 13
  • 85. 4.6) Probability & Bayesian Inference85Wednesday, May 8, 13
  • 86. 4.6.1) Gmail Suggested Recipients86Wednesday, May 8, 13
  • 87. 4.6.1) Reproducing it with Pig...87Wednesday, May 8, 13
  • 88. 4.6.2) Step 1: COUNT(From -> To)88Wednesday, May 8, 13
  • 89. 4.6.2) Step 2: COUNT(From, To, Cc)/Total89P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someoneWednesday, May 8, 13
  • 90. 4.6.3) Wait - Stop Here! It works!90They match…Wednesday, May 8, 13
  • 91. 4.4) Add predictions to reports91Wednesday, May 8, 13
  • 92. 5) Enable new actions92Wednesday, May 8, 13
  • 93. Why doesn’t Kate reply to my emails?• What time is best to catch her?• Are they too long?• Are they meant to be replied to (contain original content)?• Are they nice? (sentiment analysis)• Do I reply to her emails (reciprocity)?• Do I cc the wrong people (my mom)?93Wednesday, May 8, 13
  • 94. Example: Packetpig and PacketLoop94snort_alerts  =  LOAD  $pcap    USING  com.packetloop.packetpig.loaders.pcap.detection.SnortLoader($snortconfig);countries  =  FOREACH  snort_alerts    GENERATE        com.packetloop.packetpig.udf.geoip.Country(src)  as  country,        priority;countries  =  GROUP  countries  BY  country;countries  =  FOREACH  countries    GENERATE        group,        AVG(countries.priority)  as  average_severity;STORE  countries  into  output/choropleth_countries  using  PigStorage(,);Code here.Wednesday, May 8, 13
  • 95. Example: Packetpig and PacketLoop95Wednesday, May 8, 13
  • 96. Thank You!Questions & AnswersFollow: @rjurneyRead the Blog: datasyndrome.com96Wednesday, May 8, 13