The Rise of Big Data Science
Upcoming SlideShare
Loading in...5
×
 

The Rise of Big Data Science

on

  • 297 views

This is an introductory lecture of the buzziest domain technology nowadays. ...

This is an introductory lecture of the buzziest domain technology nowadays.
The domain encapsulates a lot of new concepts, keywords, theories and paradigm shifts, from computer science to business.

Statistics

Views

Total Views
297
Views on SlideShare
297
Embed Views
0

Actions

Likes
0
Downloads
14
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • It’s an introductory lecture of the buzziest domain technology nowadays.The domain encapsulates a lot of new concepts, keywords, theories which make the full academic rainbow from computer science to business departments very busy to digest these upcoming, fast pacing concepts.Academies should, and do, offer new tracks to support these developments
  • This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • We’ll start with the why and then the what will be better understood.Big Data is a business / technological aspect of a wider social phenomena we’re currently leave in.As all past social revolutions, they were all started with a technological revolution, e.g. the French revolution was a side effect of the industrial revolution.This is a same case where the Internet created a social revolutionEveryone is connected to everyone
  • Actually the Big Data as a phenomena started with the rise of Web2.0, where unlike the older Web 1.o, where only site owners created the online data, then came the users which create the content
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • Big Data -> big numbers.Taken from http://visual.ly/what-big-data
  • Big Users is an equally big trend driving developers to use NoSQL databases.Most new applications are made available over the internet so people can easily access them.This has caused the number of simultaneous users for many applications to explode.The number of people connected to the internet is more than 2B and growing rapidly.The number of hours that the average user spends on the internet is growing too further increasing the number of simultaneous users.And, with the proliferation of smart phones, people use their applications more and more frequently further increasing the number of simultaneous users.All these simultaneous users leads to a rapidly growing number of database operations and the need for a far easier way to scale your database to meet these demands.Taken from Couchbase deck @ IGTCloud summit 2013http://www.go-gulf.com/blog/online-timehttp://business.time.com/2012/02/14/one-billion-smartphones-by-2016-here-comes-the-mobile-arms-race/
  • To summarize, the technology implications of the Big Data, Big User, and Cloud Computing mega trends are causing people to seriously rethink what database they use for their applications and are increasingly coming to the conclusion that NoSQL databases are a better fit than relational databases.
  • Finally, the move to cloud computing and SaaS business models is also driving developers to consider NoSQL databases.15 years ago most applications were developed with a client/server architecture and a packaged software business model that supported the needs of users on a company-by-company basis.Today, applications are increasingly developed using a 3-tier internet architecture, are cloud-based, and use a Software-as-a-Service business model that needs to support the collective needs of thousandsvof customersThis approach increasingly requires a horizontally scalable architecture that easily scales with the number of users and amount of data your application has.
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • Outbrain serves 8 billion impressions a month = 3000 impressions / sec ; DG (MediaMind) serves 50 billion a day = 500K/sechttp://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-datahttp://www.computerworlduk.com/in-depth/applications/1779/oracles-database-machine-how-much-will-it-really-cost/
  • http://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-data
  • MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  • MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  • Taken from http://db-engines.com/en/ranking
  • This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  • Ok, we have the big data. Now, what are we doing with it?Big data is important if you want to be successful in analytic processing. But, why is that important? The answer is that success in a highly competitive, fast-moving marketplace is determined by who can capitalize on business opportunities before everyone else seizes the same opportunity. In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data
  • Although Onavo has started from a service that optimizes devices & apps performance, on the way they’ve collected logs from these apps & devices and became one of the leading mobile analytics aggregators in the world
  • Notations first.It has many names that mean more or less the same: the art of inference insights from data
  • In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data.Domains applying data science / data mining.. Vary:
  • Learning is comprised of three steps: First, we build our probabilistic model of the real worldThen, we train the model with labeled (supervised) examples, i.e. this is a car, this is not a car. This takes place offline.Last, online, we feed the model with a totally new example and expect it will predict for us the correct prediction
  • Drew Conway, http://www.dataists.com/2010/09/the-data-science-venn-diagram

The Rise of Big Data Science The Rise of Big Data Science Presentation Transcript

  • The Rise of Big Data Science GILAD BARKAN
  • Big Data Science Big Data Data Science Big Data Science
  • Big Data  Why ?  What ?  How ?
  • Big Data  Why ?  What ?  How ?
  • Why Big Data ?  It’s the flooded information era we live in  In a world where data is power, big data is big power
  • Why Big Data ?  Web 2.0
  • Why should we care about Big Data ?  The big business opportunities  Competitive fast moving marketplace  Capitalize on business opportunities before everyone else Existing channels to every person on the planet  Maximizing revenues from customers  Segment-of-1 - more personal customer experiences 
  • Big Data  Why ?  What ?  How ?
  • What is Big Data ?  The 3 V’s Volume Variety Velocity
  • What is Big Data ?  The 3 V’s Volume Variety Velocity
  • Big Data - Volume
  • Big Data - Volume Big Users More Users, All the Time 2 35 1 + Billion Global Online Population Billion Hours Hours Spent Online Billion Smartphone Users
  • More Users More Data + Big Data
  • What is Big Data ?  The 3 V’s Volume Variety Velocity
  • Big Data - Variety Trillions of Gigabytes (Zettabytes)  Heterogeneous sources of data  Structured Un/SemiStructured Data  Unstructured Structured Data Audio images tables text video 700 MB / movie Text, Log Files, Click 5000 KB / song Streams, Blogs, T weets, Audio, Vide o, etc. 1000 KB / image 5 KB / record Traditional Structured SQL 50 KB / record Unstructured NoSQL
  • What is Big Data ?  The 3 V’s Volume Variety Velocity
  • Big Data - Velocity  How the hell does Google return an answer in 0.28 seconds by looking at 4 Billion pages?
  • Big Data - Velocity  Online Advertisement - Real Time Bidding (RTB)
  • Big Data - Velocity  Recommendations
  • Big Data  Why ?  What ?  How ?
  • How is Big Data Handled ?  The challenge is huge  Store, analyze and serve huge volume of variety of data in high velocity  We can’t achieve this using a single machine, no matters how strong it is. Why? Expensive – stay tuned  Load balancing requests  Outbrain serves 3,000 per second  DG (MediaMind) serves 500K per second!!!   Not fault tolerant
  • The Big Data Paradigms Shifts Volume Distributing the Data Scale Out Scale Up (Horizontal) (Vertical) SQL Server Hadoop Cluster HDFS (GFS) Nodes
  • Big Data –Reducing Costs  Hadoop is a 5 times cheaper infrastructure !!!  TCO (purchase + maintenance) for 3 years per 300 TB: DBMS server = 5 M$ 75 nodes cluster = 1 M$
  • Big Data Paradigm Shift - Computing MapReduce Computing Paradigm  Exploiting the distributed architecture for large scale computations in parallel
  • MapReduce  “Hello MapReduce” – counting words Map Mappers W the C the 7 Cow 1 quick 0 W C the 9 Cow Hadoop Cluster 2 W URL 2 0 quick 1 quick 3 Reduce 5 Cow Master C Reducer + W C the 21 Cow 2 quick 5
  • Big Data Paradigm Shift – NoSQL Variety  Schema-less databases to support the variety of data  Complex SQL queries (joins, etc.) in a distributed data framework is extremely inefficient   Key-Value Store NoSQL Key Value user_id Any – not single primary as in SQL tables url text image_id video_id images video any
  • Big Data Paradigm Shift – Velocity  RAM-based DBs instead of traditional disk-based DBs  Store critical data in memory (much more expensive)  If the data doesn't come to Alg - Alg will come to the data Alg Write Read Data Alg Read Write Data traditional today
  • Big Data - Summary
  • Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Technological paradigm shifts
  • Big Data Technological Paradigm Shifts Volume Scale up Map Variety NoSQL Scale Out Mappers Key Value Velocity Reduce Alg Alg Data Master Reducer Data
  • Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies
  • Flood of New Big Data Technologies  Open Source
  • Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz
  • Big Buzz ?
  • Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz It’s a real response to the world hectic paced evolution  reducing costs by order of magnitude   Still it doesn’t mean every business today will / should transform its technology stack to support big data
  • Big Data Science Big Data Data Science Big Data Science
  • Data Science  Why ?  What ?  How ?
  • Data Science  Why ?  What ?  How ?
  • Why Data Science ? data scientists
  • Data is a real value  Facebook acquires Onavo for ~150M$
  • Data Science  Why ?  What ?  How ?
  • Welcome to the Intelligent world Data Analysis Data Mining Data Analytics Data Science Automatic Decisioning Machine Learning Predictive Analytics
  • Data Miners are the New Gold Miners
  • Search
  • Online Advertisement - Real Time Bidding (RTB)
  • Recommendations  Recommendations
  • Text Analysis
  • CRM – Customers Churn Prediction
  • Time Series Analysis
  • Machine Learning  Classification  Clustering  Regression  Recommendation
  • Classification Amdocs Insight™ - why is the customer calling the Call Center ? Pay Bill Third Party Charges Bill too high Overage Abnormal fee
  • Clustering Market Segmentation Social Network Analysis
  • Regression  Housing price prediction 400 Price ($) in 1000’s 300 280 215 200 100 50 100 130 150 Size in m2 200 250
  • The Data Scientist
  • Data Scientist Skillset Hands on tools, languages, technologies MsC / PhD in Math, CS, Stats, Physics Hands on the specific problem domain
  • Data Science ≠ BI  Apply advanced statistical machine learning algorithms to: dig deeper to find patterns that traditional BI tools may not reveal  much wider domains / applications spectrum   Predictive Analytics ≠ Exploratory Analytics
  • Predictive Analytics Data Science Big Data Science Vs. Exploratory Analytics Business Intelligence Traditional BI Exploratory Analytics
  • Academia Response to Data Science
  • Data Science  Why ?  What ?  How ?
  • The Art of Data Science  We need at least one semester course for it  Still…
  • Data Science Life Cycle Run Time Offline Data Analysis Understand Data Prepare Data Monitor Business Goal Deploy Model Evaluate
  • Closing the Loop  Technically wise, what do you think?  Is Big Data good or bad for Data Science ? Big Data Data Science Big Data Science
  • The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
  • The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
  • The Good - The Statistical View  Statistics is predictive analytics’ fuel !  The more data you have (Big Data) the better your predictive models will perform
  • Law of Large Numbers
  • Law of Large Numbers
  • Law of Large Numbers
  • Law of Large Numbers
  • Law of Large Numbers
  • Law of Large Numbers
  • Combining the Good & Bad  Data is a function of quality and quantity High Quality Low Small Quantity Big
  • Big Data Science - Summary  Big Data   Big Numbers  Big Opportunities  Big Data is the buzziest technology nowadays  Data Scientists  the ones that coax the treasures for their companies, out of the big data  Are multi-discipline skilled  the new industry rock stars
  • Thank You for your attention