1st Birmingham Big Data Science Group meetup

  • 1,224 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,224
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
32
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Welcome to the Birmingham Big Data Science Group (BIDS)
    Faizan Javed
    5/25/2011
    Intermark Group
    Sponsor: Intermark Group
  • 2. BIDS Stats
    Founded April 10, 2011
    9 members (and counting..)
    Founder: Faizan Javed, Co-Founder: QasimIjaz
    Online presence:
    Meetup.com for co-ordinatingmeetups:
    http://www.meetup.com/bham-bids
    Also on (for related articles and announcements):
    LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219
    Facebook:http://www.facebook.com/home.php?sk=group_202221519811444
  • 3. Agenda
    What is Big Data?
    Quick overview of related technologies:
    Large-scale distributed systems and platforms
    NoSQL data stores
    Intelligent algorithms/web-mining/information retrieval techniques
    Highly-scalable systems
  • 4. What is Big Data?
    More people connected to the internet
    Social media explosion (Web 2.0): Facebook, Twitter, etc.
    Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns
    “Big” in Big Data is relative:  today's "big" is certainly tomorrow's "medium" and next week's "small.“
    “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes!http://radar.oreilly.com/2010/06/what-is-data-science.html
  • 5.
  • 6. Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp
  • 7. Why care about big data?
    Deep analysis of data can be a competitive advantage.
    More data  easier to find consistent patterns
    More data usually beats better algorithms
    Ex 1: Predict customer preferences and target ads on an ecommerce website.
    Ex 2: Improve search quality.
    Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses)
    http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/
    http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC
    Key point: “Many different sources” & “unstructured data”
  • 8. Big Players on the Big Data Scene
    The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef
  • 9. The need for new techniques
    Traditional “relational” techniques breakdown at scale.
    Solutions:
    NoSQL databases: Cassandra, Hbase, Riak, etc
    Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc
    Analytics platforms: IBM BigInsight, EMC GreenPlum
  • 10. The NoSQL revolutionhttp://www.infoq.com/news/2011/04/newsql
  • 11. Prominent NoSQL database users
    Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com
    Riak: Mozilla, Ask.com, Comcast
    Voldemort: LinkedIn
    MongoDB: Foursquare, Etsy, bit.ly, Intuit
    Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com,
  • 12. Hadoop-based SMAQ stackhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
    {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException
    {
    int sum = 0;
    for (IntWritableval : values)
    {
    sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    }
    }
  • 13. Hadoop-based SMAQ stack
    Hadoop comes with HDFS – Hadoop Distributed File Sytem.
    Can be used alongside various NoSQL systems (Hbase most common)
  • 14. Hadoop-based SMAQ stack
    Pig (yahoo)
    input = LOAD 'input/sentences.txt' USING TextLoader();
    words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0;
    counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage();
    Hive (facebook)
    INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';
  • 15. Next-generation systems: going beyond MapReduce/Hadoophttp://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html
    Mostly Google and Yahoo innovations.
    Percolator – “real-time” MapReduce. Powers Google Instant.
    Dremel – superfast “Hive” to interact with large-datasets. Inhouse-Google.
    Pregel– highly efficient graph computing for analyzing social graphs. In-house Google. Open-source projects available.
    Megastore- scalable NoSQL like system with ACID semantics but lower consistency across partitions. In-house Google.
    Next-gen Hadoop at Yahoo: enhanced scalability (going beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.
  • 16. Intelligent Web & machine learning
    Recommendation systems, data/web mining, natural language processing
    Recommendation systems:
    A type of collaborative filtering/information retrieval technique.
    Uses user profiles, ratings, browsing habits to recommend items not yet considered.
    First made famous in the commercial arena by Amazon.com
  • 17. Amazon.com & Netflix recommendation systems
  • 18. Foursquare (3/2011) and Google Places (5/2011)http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html
  • 19. Hot area!Netflix and Overstock.com competitions
  • 20. Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)
  • 21. Search innovations @ LinkedInhttp://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/
    Uses open-source Luceneproject for social graph search and real-time indexing and searching.
    Dynamic filters automatically generated based on your query results!
  • 22. Conclusion
    Big Data is a very challenging and promising area
    Can be used to get a competitive advantage
    Usually bring about advances in computer science
    Vast area of topics: NoSQL systems, large-scale distributed computing systems, highly scalable web system designs
    Machine learning techniques: search engines, recommender systems