To Infinity and Beyond - OSDConf2014


Published on

The story of how solving one problem the OpenSource way
opened doors to so much more. Talk presented by Pranav Prakash and Hari Prasanna at OSDConf 2014, New Delhi.

Published in: Technology, Education

To Infinity and Beyond - OSDConf2014

  1. 1. TO INFINITY AND BEYOND Pranav Prakash Search @LinkedIn Hari Prasanna BigData @LinkedIn The story of how solving one problem the OpenSource way opened doors to so much more
  2. 2. OpenSource Chain Reaction How “it” begins
  3. 3. OpenSource Chain Reaction How “it” begins How “it” grows
  4. 4. OpenSource Chain Reaction How “it” begins How “it” grows How “it” contributes
  5. 5. LUCENE Information Retrieval Library Started in 1999 as project Joins Apache in 2001 in Jakarta’s family Top Level Project in 2005 LinkedIn, Twitter, Comcast
  6. 6. LUCENE IR requirements What would you do next? Be better at searching Crawl the web
  7. 7. Web Wrapper around Lucene Full Text Search, NRT Indexing Faceted Search, Clustering
  8. 8. NUTCH Web Crawler Billions of pages on the internet Alternate to commercial engines
  9. 9. From a single tool to an ecosystem • Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the ecosystem
  10. 10. Ecosystem
  11. 11. • Features: • Distributed storage - HDFS • Distributed processing - MapReduce • Fault tolerance • Horizontal scalability • Comparisons • RDBMS • Grid computing • Use Cases • Analytics (trends, predictions, summaries etc.,) • Searching and Indexing Hadoop
  12. 12. • Features: • Column based storage • Horizontal scalability • Low latency reads • MapReduce support • SQL Support with Phoenix • Coprocessors and secondary indexes • RDBMS vs HBase • Use cases • Facebook messages • Monitoring with openTSDB HBase
  13. 13. Vanilla MapReduce ! ! ! ! ! Higher Abstractions • Pig - data flow language • Hive - SQL to MapReduce adapter • Cascading - Pipeline primitives and other powerful abstractions • Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu Java MapReduce Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. Example 2-3 shows the implementation of our map method. Example 2-3. Mapper for maximum temperature example import; Figure 2-1. MapReduce logical data flow Data Processing
  14. 14. • Data collection, aggregation and forwarding with Kafka, Flume, Scribe. • Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon. • Graph processing a trillion edges in facebook with Apache Giraph
  15. 15. • Quickstarting with the cloudera distribution • Getting one step through the door - SlideShare’s journey • Can your app survive without it? - Raising your bar • Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today? • The road ahead • Keeping track of the developments and giving back Leveraging “Big Data”
  16. 16. • Scientific Research - Scihadoop, decoding DNA • Finance - Fraud Detection, Algorithmic trading, Risk Management • Web - Network Analysis, Recommendation Engines, Personalization • Government - Election campaigns, intelligence systems • Supply chain optimization, Weather forecasting In the Wild