1st Birmingham Big Data Science Group meetup
Upcoming SlideShare
Loading in...5

1st Birmingham Big Data Science Group meetup






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup Presentation Transcript

    • Welcome to the Birmingham Big Data Science Group (BIDS)
      Faizan Javed
      Intermark Group
      Sponsor: Intermark Group
    • BIDS Stats
      Founded April 10, 2011
      9 members (and counting..)
      Founder: Faizan Javed, Co-Founder: QasimIjaz
      Online presence:
      Meetup.com for co-ordinatingmeetups:
      Also on (for related articles and announcements):
      LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219
    • Agenda
      What is Big Data?
      Quick overview of related technologies:
      Large-scale distributed systems and platforms
      NoSQL data stores
      Intelligent algorithms/web-mining/information retrieval techniques
      Highly-scalable systems
    • What is Big Data?
      More people connected to the internet
      Social media explosion (Web 2.0): Facebook, Twitter, etc.
      Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns
      “Big” in Big Data is relative:  today's "big" is certainly tomorrow's "medium" and next week's "small.“
      “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes!http://radar.oreilly.com/2010/06/what-is-data-science.html
    • Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp
    • Why care about big data?
      Deep analysis of data can be a competitive advantage.
      More data  easier to find consistent patterns
      More data usually beats better algorithms
      Ex 1: Predict customer preferences and target ads on an ecommerce website.
      Ex 2: Improve search quality.
      Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses)
      Key point: “Many different sources” & “unstructured data”
    • Big Players on the Big Data Scene
      The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef
    • The need for new techniques
      Traditional “relational” techniques breakdown at scale.
      NoSQL databases: Cassandra, Hbase, Riak, etc
      Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc
      Analytics platforms: IBM BigInsight, EMC GreenPlum
    • The NoSQL revolutionhttp://www.infoq.com/news/2011/04/newsql
    • Prominent NoSQL database users
      Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com
      Riak: Mozilla, Ask.com, Comcast
      Voldemort: LinkedIn
      MongoDB: Foursquare, Etsy, bit.ly, Intuit
      Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com,
    • Hadoop-based SMAQ stackhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
      public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
      public void reduce(Text key, Iterable<IntWritable> values, Context context)
      throws IOException, InterruptedException
      int sum = 0;
      for (IntWritableval : values)
      sum += val.get();
      context.write(key, new IntWritable(sum));
    • Hadoop-based SMAQ stack
      Hadoop comes with HDFS – Hadoop Distributed File Sytem.
      Can be used alongside various NoSQL systems (Hbase most common)
    • Hadoop-based SMAQ stack
      Pig (yahoo)
      input = LOAD 'input/sentences.txt' USING TextLoader();
      words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0;
      counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage();
      Hive (facebook)
      INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';
    • Next-generation systems: going beyond MapReduce/Hadoophttp://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html
      Mostly Google and Yahoo innovations.
      Percolator – “real-time” MapReduce. Powers Google Instant.
      Dremel – superfast “Hive” to interact with large-datasets. Inhouse-Google.
      Pregel– highly efficient graph computing for analyzing social graphs. In-house Google. Open-source projects available.
      Megastore- scalable NoSQL like system with ACID semantics but lower consistency across partitions. In-house Google.
      Next-gen Hadoop at Yahoo: enhanced scalability (going beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.
    • Intelligent Web & machine learning
      Recommendation systems, data/web mining, natural language processing
      Recommendation systems:
      A type of collaborative filtering/information retrieval technique.
      Uses user profiles, ratings, browsing habits to recommend items not yet considered.
      First made famous in the commercial arena by Amazon.com
    • Amazon.com & Netflix recommendation systems
    • Foursquare (3/2011) and Google Places (5/2011)http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html
    • Hot area!Netflix and Overstock.com competitions
    • Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)
    • Search innovations @ LinkedInhttp://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/
      Uses open-source Luceneproject for social graph search and real-time indexing and searching.
      Dynamic filters automatically generated based on your query results!
    • Conclusion
      Big Data is a very challenging and promising area
      Can be used to get a competitive advantage
      Usually bring about advances in computer science
      Vast area of topics: NoSQL systems, large-scale distributed computing systems, highly scalable web system designs
      Machine learning techniques: search engines, recommender systems