1st Birmingham Big Data Science Group meetup


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

1st Birmingham Big Data Science Group meetup

  1. 1. Welcome to the Birmingham Big Data Science Group (BIDS)<br />Faizan Javed<br />5/25/2011<br />Intermark Group<br />Sponsor: Intermark Group<br />
  2. 2. BIDS Stats<br />Founded April 10, 2011<br /> 9 members (and counting..)<br />Founder: Faizan Javed, Co-Founder: QasimIjaz<br />Online presence:<br />Meetup.com for co-ordinatingmeetups:<br />http://www.meetup.com/bham-bids<br />Also on (for related articles and announcements):<br />LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219<br />Facebook:http://www.facebook.com/home.php?sk=group_202221519811444<br />
  3. 3. Agenda<br />What is Big Data?<br />Quick overview of related technologies:<br />Large-scale distributed systems and platforms<br />NoSQL data stores<br /> Intelligent algorithms/web-mining/information retrieval techniques<br /> Highly-scalable systems<br />
  4. 4. What is Big Data?<br />More people connected to the internet<br />Social media explosion (Web 2.0): Facebook, Twitter, etc.<br />Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns<br />“Big” in Big Data is relative:  today's "big" is certainly tomorrow's "medium" and next week's "small.“<br />“Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes!http://radar.oreilly.com/2010/06/what-is-data-science.html<br />
  5. 5.
  6. 6. Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp<br />
  7. 7. Why care about big data?<br />Deep analysis of data can be a competitive advantage.<br />More data  easier to find consistent patterns<br />More data usually beats better algorithms<br />Ex 1: Predict customer preferences and target ads on an ecommerce website.<br />Ex 2: Improve search quality.<br />Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses)<br />http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/<br />http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC<br />Key point: “Many different sources” & “unstructured data”<br />
  8. 8. Big Players on the Big Data Scene<br />The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef<br />
  9. 9. The need for new techniques<br />Traditional “relational” techniques breakdown at scale. <br />Solutions:<br />NoSQL databases: Cassandra, Hbase, Riak, etc<br />Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc<br />Analytics platforms: IBM BigInsight, EMC GreenPlum<br />
  10. 10. The NoSQL revolutionhttp://www.infoq.com/news/2011/04/newsql<br />
  11. 11. Prominent NoSQL database users<br />Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com<br />Riak: Mozilla, Ask.com, Comcast<br />Voldemort: LinkedIn<br />MongoDB: Foursquare, Etsy, bit.ly, Intuit<br />Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com, <br />
  12. 12. Hadoop-based SMAQ stackhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html<br />public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> <br />{ <br />public void reduce(Text key, Iterable<IntWritable> values, Context context) <br />throws IOException, InterruptedException<br />{<br />int sum = 0; <br /> for (IntWritableval : values) <br /> { <br /> sum += val.get(); <br /> } <br />context.write(key, new IntWritable(sum));<br />} <br />}<br />
  13. 13. Hadoop-based SMAQ stack<br />Hadoop comes with HDFS – Hadoop Distributed File Sytem.<br />Can be used alongside various NoSQL systems (Hbase most common)<br />
  14. 14. Hadoop-based SMAQ stack<br />Pig (yahoo)<br />input = LOAD 'input/sentences.txt' USING TextLoader(); <br /> words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; <br /> counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage();<br />Hive (facebook)<br /> INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';<br />
  15. 15. Next-generation systems: going beyond MapReduce/Hadoophttp://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html<br />Mostly Google and Yahoo innovations.<br />Percolator – “real-time” MapReduce. Powers Google Instant.<br />Dremel – superfast “Hive” to interact with large-datasets. Inhouse-Google.<br />Pregel– highly efficient graph computing for analyzing social graphs. In-house Google. Open-source projects available.<br />Megastore- scalable NoSQL like system with ACID semantics but lower consistency across partitions. In-house Google.<br />Next-gen Hadoop at Yahoo: enhanced scalability (going beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.<br />
  16. 16. Intelligent Web & machine learning<br />Recommendation systems, data/web mining, natural language processing<br />Recommendation systems:<br />A type of collaborative filtering/information retrieval technique.<br />Uses user profiles, ratings, browsing habits to recommend items not yet considered.<br />First made famous in the commercial arena by Amazon.com<br />
  17. 17. Amazon.com & Netflix recommendation systems<br />
  18. 18. Foursquare (3/2011) and Google Places (5/2011)http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html<br />
  19. 19. Hot area!Netflix and Overstock.com competitions<br />
  20. 20. Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)<br />
  21. 21. Search innovations @ LinkedInhttp://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/ <br />Uses open-source Luceneproject for social graph search and real-time indexing and searching.<br />Dynamic filters automatically generated based on your query results!<br />
  22. 22. Conclusion<br />Big Data is a very challenging and promising area<br />Can be used to get a competitive advantage<br />Usually bring about advances in computer science<br />Vast area of topics: NoSQL systems, large-scale distributed computing systems, highly scalable web system designs<br />Machine learning techniques: search engines, recommender systems<br />