SlideShare a Scribd company logo
TO INFINITY AND BEYOND
Pranav Prakash
in.linkedin.com/in/prakashpranav
Search @LinkedIn
Hari Prasanna
in.linkedin.com/in/mostlycached
BigData @LinkedIn
The story of how solving one problem the OpenSource way
opened doors to so much more
OpenSource Chain Reaction
How “it” begins
OpenSource Chain Reaction
How “it” begins
How “it” grows
OpenSource Chain Reaction
How “it” begins
How “it” grows
How “it” contributes
LUCENE
Information Retrieval Library
Started in 1999 as SourceForge.net project
Joins Apache in 2001 in Jakarta’s family
Top Level Project in 2005
LinkedIn, Twitter, Comcast
LUCENE
IR requirements
What would you do next?
Be better at searching
Crawl the web
Web Wrapper around Lucene
Full Text Search, NRT Indexing
Faceted Search, Clustering
NUTCH
Web Crawler
Billions of pages on the internet
Alternate to commercial engines
From a single tool to an ecosystem
• Breaking away from the initial problem statement
• The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to
HDFS, HBase and Giraph
• The thrill and chaos of working with alpha software - from dealing with
compatibility issues to being a part of active development
• Interoperability between various systems
• Ever widening scope of the project and leveraging other tools in the
ecosystem
Ecosystem
• Features:
• Distributed storage - HDFS
• Distributed processing - MapReduce
• Fault tolerance
• Horizontal scalability
• Comparisons
• RDBMS
• Grid computing
• Use Cases
• Analytics (trends, predictions, summaries etc.,)
• Searching and Indexing
Hadoop
• Features:
• Column based storage
• Horizontal scalability
• Low latency reads
• MapReduce support
• SQL Support with Phoenix
• Coprocessors and secondary indexes
• RDBMS vs HBase
• Use cases
• Facebook messages
• Monitoring with openTSDB
HBase
Vanilla MapReduce
!
!
!
!
!
Higher Abstractions
• Pig - data flow language
• Hive - SQL to MapReduce adapter
• Cascading - Pipeline primitives and other powerful abstractions
• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like
datafu
Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to
run the job. The map function is represented by the Mapper class, which declares an
abstract map() method. Example 2-3 shows the implementation of our map method.
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
Figure 2-1. MapReduce logical data flow
Data Processing
• Data collection, aggregation and forwarding with
Kafka, Flume, Scribe.
• Real time stream processing with Storm to enable
online machine learning, real time analytics in
twitter, groupon.
• Graph processing a trillion edges in facebook with
Apache Giraph
• Quickstarting with the cloudera distribution
• Getting one step through the door - SlideShare’s journey
• Can your app survive without it? - Raising your bar
• Programmer, Administrator, DBA, Data Scientist - what
hat are you wearing today?
• The road ahead
• Keeping track of the developments and giving back
Leveraging “Big Data”
• Scientific Research - Scihadoop, decoding DNA
• Finance - Fraud Detection, Algorithmic trading, Risk
Management
• Web - Network Analysis, Recommendation Engines,
Personalization
• Government - Election campaigns, intelligence
systems
• Supply chain optimization, Weather forecasting
In the Wild
To Infinity and Beyond - OSDConf2014

More Related Content

What's hot

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
Jason Plurad
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
Safe Software
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsEdureka!
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and Gremlin
Jason Plurad
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache Spark
Databricks
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
Juantomás García Molina
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
Eva Tse
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
EdwardBloom
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Computing at scale
Computing at scaleComputing at scale
Computing at scale
jerjou
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool Shapes
Shane Coughlan
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 

What's hot (14)

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time Analytics
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and Gremlin
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache Spark
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Computing at scale
Computing at scaleComputing at scale
Computing at scale
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool Shapes
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 

Viewers also liked

Solidry @ bakheda2
Solidry @ bakheda2Solidry @ bakheda2
Solidry @ bakheda2
Pranav Prakash
 
Test document
Test documentTest document
Test document
Pranav Prakash
 
How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceArun
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peachesPranav Prakash
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
Pranav Prakash
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
John Breslin
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
Pranav Prakash
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote
Twilio Inc
 

Viewers also liked (12)

Solidry @ bakheda2
Solidry @ bakheda2Solidry @ bakheda2
Solidry @ bakheda2
 
#comments
#comments#comments
#comments
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Test document
Test documentTest document
Test document
 
How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media Experience
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peaches
 
Banana peaches
Banana peachesBanana peaches
Banana peaches
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote
 

Similar to To Infinity and Beyond - OSDConf2014

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Imam Raza
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
LarKC
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Data Sciences Learning
Data Sciences LearningData Sciences Learning
Data Sciences Learning
AmmarAhmedSiddiqui2
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
21Style
 

Similar to To Infinity and Beyond - OSDConf2014 (20)

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
 
963
963963
963
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Sciences Learning
Data Sciences LearningData Sciences Learning
Data Sciences Learning
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 

More from Pranav Prakash

Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
Pranav Prakash
 
Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
Pranav Prakash
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
Pranav Prakash
 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peachesPranav Prakash
 
MIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar reportMIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar report
Pranav Prakash
 
Introduction to Category Theory for software engineers
Introduction to Category Theory for software engineersIntroduction to Category Theory for software engineers
Introduction to Category Theory for software engineers
Pranav Prakash
 
PyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appenginePyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appengine
Pranav Prakash
 

More from Pranav Prakash (19)

Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Oranges
OrangesOranges
Oranges
 
Oranges peaches
Oranges peachesOranges peaches
Oranges peaches
 
Banana
BananaBanana
Banana
 
Banana oranges
Banana orangesBanana oranges
Banana oranges
 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peaches
 
Apple
AppleApple
Apple
 
Apple peaches
Apple peachesApple peaches
Apple peaches
 
Apple oranges
Apple orangesApple oranges
Apple oranges
 
Apple oranges peaches
Apple oranges peachesApple oranges peaches
Apple oranges peaches
 
Apple banana
Apple bananaApple banana
Apple banana
 
Apple banana peaches
Apple banana peachesApple banana peaches
Apple banana peaches
 
Apple banana oranges
Apple banana orangesApple banana oranges
Apple banana oranges
 
Peaches
PeachesPeaches
Peaches
 
MIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar reportMIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar report
 
Introduction to Category Theory for software engineers
Introduction to Category Theory for software engineersIntroduction to Category Theory for software engineers
Introduction to Category Theory for software engineers
 
PyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appenginePyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appengine
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

To Infinity and Beyond - OSDConf2014

  • 1. TO INFINITY AND BEYOND Pranav Prakash in.linkedin.com/in/prakashpranav Search @LinkedIn Hari Prasanna in.linkedin.com/in/mostlycached BigData @LinkedIn The story of how solving one problem the OpenSource way opened doors to so much more
  • 3. OpenSource Chain Reaction How “it” begins How “it” grows
  • 4. OpenSource Chain Reaction How “it” begins How “it” grows How “it” contributes
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. LUCENE Information Retrieval Library Started in 1999 as SourceForge.net project Joins Apache in 2001 in Jakarta’s family Top Level Project in 2005 LinkedIn, Twitter, Comcast
  • 10. LUCENE IR requirements What would you do next? Be better at searching Crawl the web
  • 11. Web Wrapper around Lucene Full Text Search, NRT Indexing Faceted Search, Clustering
  • 12. NUTCH Web Crawler Billions of pages on the internet Alternate to commercial engines
  • 13. From a single tool to an ecosystem • Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the ecosystem
  • 15. • Features: • Distributed storage - HDFS • Distributed processing - MapReduce • Fault tolerance • Horizontal scalability • Comparisons • RDBMS • Grid computing • Use Cases • Analytics (trends, predictions, summaries etc.,) • Searching and Indexing Hadoop
  • 16. • Features: • Column based storage • Horizontal scalability • Low latency reads • MapReduce support • SQL Support with Phoenix • Coprocessors and secondary indexes • RDBMS vs HBase • Use cases • Facebook messages • Monitoring with openTSDB HBase
  • 17. Vanilla MapReduce ! ! ! ! ! Higher Abstractions • Pig - data flow language • Hive - SQL to MapReduce adapter • Cascading - Pipeline primitives and other powerful abstractions • Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu Java MapReduce Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. Example 2-3 shows the implementation of our map method. Example 2-3. Mapper for maximum temperature example import java.io.IOException; Figure 2-1. MapReduce logical data flow Data Processing
  • 18. • Data collection, aggregation and forwarding with Kafka, Flume, Scribe. • Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon. • Graph processing a trillion edges in facebook with Apache Giraph
  • 19. • Quickstarting with the cloudera distribution • Getting one step through the door - SlideShare’s journey • Can your app survive without it? - Raising your bar • Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today? • The road ahead • Keeping track of the developments and giving back Leveraging “Big Data”
  • 20. • Scientific Research - Scihadoop, decoding DNA • Finance - Fraud Detection, Algorithmic trading, Risk Management • Web - Network Analysis, Recommendation Engines, Personalization • Government - Election campaigns, intelligence systems • Supply chain optimization, Weather forecasting In the Wild