Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink bizarro

161 views

Published on

Ted Dunning at Apache Flink London Meetup

Published in: Technology
  • I was at the presentation, caught (as a German) in a Bermuda triangle of German earnest, American Bizarro Humour, and British local context. Was this a presentation about Flink? Well, not really. It was more a presentation about how to think when you treat the world as a stream. How would you build a web indexer? Here, the stream starts from a seed URL, and each link you find gets added to the queue and will let you find more links, until you have processed the whole world wide Internet. So the fetcher needs to have some criteria for skipping or stopping. The Q&A session in Bizarro World becomes an A&Q session, so Stephan answered that he would do all operators in one Flink pipeline, rather than splitting them into separate processes. Did Ted agree? To be honest, I was so confused that I don’t remember. But it was fun, for sure.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Flink bizarro

  1. 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies
  2. 2. ® © 2014 MapR Technologies 2 Bizarro Flink •  What is Bizarro •  What features you lose with Bizarro Flink •  How performance suffers with Bizarro Flink •  How it takes longer to develop with Bizarro Flink •  Sample Application
  3. 3. ® © 2014 MapR Technologies 3 Bizarro: Summary •  Bizarro World, also known as thraE •  Everything is inverted on thraE –  “Us do opposite of all Earthly things!” •  Good is bad. Beauty is ugly. •  Thus, Bizarro Flink
  4. 4. ® © 2014 MapR Technologies 4 The Goal •  (Re) write a web spider –  Fetch pages –  Record time, status, all headers, page content –  Extract plain text and outgoing links –  Propagate incoming links to target pages with anchor text •  Be fashionable –  Use streaming (MapR Streams, in fact) •  Go Bizarro –  Don’t use Flink
  5. 5. ® © 2014 MapR Technologies 5 fetch content pages extractor new urls { url status headers moved_to body out_links in_links } { url status headers moved_to content } { url } metrics exceptions metrics exceptions linknew_links
  6. 6. ® © 2014 MapR Technologies 6 Table Contents url status moved_to request_headers response_headers body text out_links in_links Extraction column family Link column family
  7. 7. ® © 2014 MapR Technologies 7 Fetch Worker fetch pages new urls { url } metrics exceptions { url status headers moved_to content }
  8. 8. ® © 2014 MapR Technologies 8 Extraction Worker content pages extractor new urls { url } metrics exceptions new_links
  9. 9. ® © 2014 MapR Technologies 9 Link Worker content linknew_links
  10. 10. ® © 2014 MapR Technologies 10 Where Does This Leave Us? •  A good streaming foundation takes us quite far •  No worries (to speak of) about lack of multi-row transactions •  Coding style reminiscent of Unix stdin/stderr programming –  Mocked tests very easy to write •  Crash tolerance/scaling/restart characteristics are very good •  Performance is well off what it could be, largely serialization
  11. 11. ® © 2014 MapR Technologies 11 Where Does This Leave Flink? •  The base level of technology is much higher than before –  Building this system years ago would take much more than a long afternoon –  Testing would take even longer (if even possible) •  But Flink still gives a big boost –  Level of bulls*t code is much reduced –  Data motion is often eliminated –  Serialization is much faster (but amazingly not much simpler)
  12. 12. ® © 2014 MapR Technologies 12 Q&A @mapr maprtech tdunning@maprtech.com Engage with us! MapR maprtech mapr-technologies

×