Get involved with the Apache Software Foundation
Upcoming SlideShare
Loading in...5
×
 

Get involved with the Apache Software Foundation

on

  • 3,702 views

Presented at Indian Institute of Information Technology (IIIT) Allahabad on 21 Oct 2009 to students about the Apache Software Foundation, Lucene, Solr, Hadoop and on the benefits of contributing to ...

Presented at Indian Institute of Information Technology (IIIT) Allahabad on 21 Oct 2009 to students about the Apache Software Foundation, Lucene, Solr, Hadoop and on the benefits of contributing to open source projects. The target audience was sophomore, junior and senior B.Tech students.

Statistics

Views

Total Views
3,702
Views on SlideShare
3,665
Embed Views
37

Actions

Likes
1
Downloads
33
Comments
0

3 Embeds 37

http://www.linkedin.com 21
https://www.linkedin.com 14
http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Get involved with the Apache Software Foundation Get involved with the Apache Software Foundation Presentation Transcript

  • Get involved with the Apache Software Foundation Shalin Shekhar Mangar shalin [at] apache [dot] org
  • Who am I?
  • History
    • 1996 – A ”patchy” web server
    • 1999 – The Apache Software Foundation, Tomcat, Lucene
    • 2002 – Nutch
    • 2006 – Solr, Hadoop
    • 2008 – Mahout
  • Today
    • Apache HTTPD powers 65% of all servers and serves 100 million websites!
    • Lucene powers search on thousands of web sites
    • Hadoop powers AOL, Yahoo, Facebook. Runs on thousand node clusters
    • So many projects!
    • Thousands of active contributors
  • Why work on Apache/OSS?
    • Work on what you like, when you like
    • Development in the ”real” world
    • Learn from the best
    • Build a publicly verifiable resume
    • Companies will find you!
  • Problems we're solving
    • Fast full-text search
    • Application servers & frameworks
    • Processing petabytes of data on thousands of unreliable commodity servers
    • Crawling the web
    • Scalable machine learning algorithms
    • Data mining & analytics
    • High performance, scalable, full text search library
    • Focus: Indexing + Searching Documents
    • 100% Java, no dependencies
    • No crawlers or document parsing
    • Users: Wikipedia, Technorati, SourceForge, …
    • Applications: Eclipse, Jira, Nutch, Solr, many commercial products
  • Lucene Inverted Index
  • Lucene Components
    • Inverted Index
    • Write once – merge in the background
    • Query Types – Term, Boolean, Prefix, Range
    • Scoring – TF, IDF, Length, Constant, Function
    • Filtering
  • Lucene – Towards the future
    • Near real-time search – Many engineering challenges
    • Flexible indexing – Alternate file formats, data structures
    • Updates – Common values & per-document
    • Query Optimization
    • Better language support
    • Search server built on Lucene
    • Schema
    • HTTP APIs
    • Replication
    • Distributed Search
    • Caching
    • Extensible with plugins
  • Solr – Towards the Future
    • Near Real-Time Search & Replication
    • Scale to hundreds of servers
    • Scale to thousands of indexes on a single box
    • Update documents
    • Faster auto-complete component
    • Field Collapsing
    • Clustering, Spell Suggestions, Clickstream feedback
    • Distributed File System – HDFS
    • Map/Reduce
    • Job scheduler
    • Reliably store petabytes of data
    • Compute in parallel
    • Detect/handle failures
  • Map/Reduce
    • map(key1,value) -> list<key2,value2>
    • reduce(key2, list<value2>) -> list<value3>
    • A large number of problems can be solved in this functional way
    • Sort, Word Count, PageRank, Deduplication
    • Data mining, co-occurence analysis
  • Hadoop Map/Reduce
  • Hadoop – Towards the Future
    • Better job scheduling, resource sharing
    • Hadoop Workflow systems
    • Hbase – Large databases in the cloud
    • Performance improvements
    • Hundreds more!
  • How do I start?
    • Choose your project
    • Join the mailing list or forum
    • Check out the code
    • Find open issues and feature requests
    • Ask the developers on what you can work on
  • Contributing
    • Ideas!
    • Features & Bug fixes
    • Unit tests
    • Documentation
    • Performance benchmarks
  • Do's and Don'ts
    • dnt rite sms lingo!
    • Be courteous
    • Don't be an island. Collaborate.
    • Learn from your mistakes
    • Persevere. It takes time.
  • Questions? Shalin Shekhar Mangar shalin [at] apache [dot] org http://twitter.com/shalinmangar http://shalinsays.blogspot.com