0
The Rise of Big Data Science
GILAD

BARKAN
Big Data Science

Big
Data

Data
Science

Big
Data
Science
Big Data
 Why ?

 What ?
 How ?
Big Data
 Why ?

 What ?
 How ?
Why Big Data ?
 It’s the flooded information era we live in

 In a world where data is power, big data is big power
Why Big Data ?
 Web 2.0
Why should we care about Big Data ?
 The big business opportunities
 Competitive fast moving marketplace


Capitalize o...
Big Data
 Why ?

 What ?
 How ?
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Volume
Big Data - Volume

Big Users
More Users, All the Time

2 35 1

+

Billion

Global Online
Population

Billion Hours

Hours ...
More
Users

More
Data

+

Big Data
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Variety

Trillions of Gigabytes (Zettabytes)

 Heterogeneous sources of data
 Structured
Un/SemiStructured Da...
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Velocity
 How the hell does Google return an answer in 0.28

seconds by looking at 4 Billion pages?
Big Data - Velocity
 Online Advertisement - Real Time Bidding (RTB)
Big Data - Velocity
 Recommendations
Big Data
 Why ?

 What ?
 How ?
How is Big Data Handled ?
 The challenge is huge
 Store, analyze and serve huge volume of variety of data
in high veloci...
The Big Data Paradigms Shifts
Volume

Distributing the Data
Scale Out

Scale Up

(Horizontal)

(Vertical)
SQL Server
Hadoo...
Big Data –Reducing Costs
 Hadoop is a 5 times cheaper infrastructure !!!

 TCO (purchase + maintenance) for 3 years per ...
Big Data Paradigm Shift - Computing
MapReduce Computing Paradigm
 Exploiting the distributed architecture for large scale...
MapReduce
 “Hello MapReduce” – counting words

Map

Mappers
W
the

C

the

7

Cow

1

quick

0

W

C

the

9

Cow

Hadoop...
Big Data Paradigm Shift – NoSQL
Variety

 Schema-less databases to support the variety of data

 Complex SQL queries (jo...
Big Data Paradigm Shift –

Velocity

 RAM-based DBs instead of traditional disk-based DBs
 Store critical data in memory...
Big Data - Summary
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Technological paradigm shifts
Big Data Technological Paradigm Shifts
Volume
Scale up

Map

Variety
NoSQL

Scale Out

Mappers

Key

Value

Velocity
Reduc...
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts...
Flood of New Big Data Technologies
 Open Source
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts...
Big Buzz ?
Big Data - Summary
 BIG business opportunities
 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts
...
Big Data Science

Big
Data

Data
Science

Big
Data
Science
Data Science
 Why ?
 What ?
 How ?
Data Science
 Why ?
 What ?
 How ?
Why Data Science ?

data
scientists
Data is a real value
 Facebook acquires Onavo for ~150M$
Data Science
 Why ?
 What ?
 How ?
Welcome to the Intelligent world

Data
Analysis

Data
Mining

Data
Analytics

Data
Science
Automatic
Decisioning

Machine
...
Data Miners are the New Gold Miners
Search
Online Advertisement - Real Time Bidding (RTB)
Recommendations
 Recommendations
Text Analysis
CRM – Customers Churn Prediction
Time Series Analysis
Machine Learning
 Classification

 Clustering
 Regression
 Recommendation
Classification

Amdocs Insight™ - why is the customer calling the Call Center ?

Pay Bill
Third Party
Charges

Bill too
hi...
Clustering

Market Segmentation
Social Network Analysis
Regression
 Housing price prediction
400

Price ($)
in 1000’s

300

280
215

200
100

50

100 130 150
Size in m2

200

25...
The Data Scientist
Data Scientist Skillset

Hands on tools,
languages,
technologies

MsC / PhD in
Math, CS, Stats,
Physics

Hands on the
spec...
Data Science ≠ BI
 Apply advanced statistical machine learning

algorithms to:
dig deeper to find patterns that tradition...
Predictive Analytics
Data Science
Big Data Science

Vs.

Exploratory Analytics
Business Intelligence
Traditional BI
Explor...
Academia Response to Data Science
Data Science
 Why ?
 What ?
 How ?
The Art of Data Science
 We need at least one semester course for it
 Still…
Data Science Life Cycle
Run Time

Offline Data
Analysis

Understand
Data

Prepare
Data

Monitor

Business
Goal

Deploy

Mo...
Closing the Loop
 Technically wise, what do you think?
 Is Big Data good or bad for Data Science ?

Big
Data

Data
Scien...
The Bad - Finding a Needle in a Haystack
 It’s the same treasure that hides – the problem is

that the pile is now huge
...
The Bad - Finding a Needle in a Haystack
 It’s the same treasure that hides – the problem is

that the pile is now huge
...
The Good - The Statistical View
 Statistics is predictive analytics’ fuel !

 The more data you have (Big Data) the bett...
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Combining the Good & Bad
 Data is a function of quality and quantity

High

Quality
Low

Small

Quantity

Big
Big Data Science - Summary
 Big Data
  Big Numbers  Big Opportunities
 Big Data is the buzziest technology nowadays
...
Thank You for your attention
Upcoming SlideShare
Loading in...5
×

The Rise of Big Data Science

473

Published on

This is an introductory lecture of the buzziest domain technology nowadays.
The domain encapsulates a lot of new concepts, keywords, theories and paradigm shifts, from computer science to business.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
473
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • It’s an introductory lecture of the buzziest domain technology nowadays.The domain encapsulates a lot of new concepts, keywords, theories which make the full academic rainbow from computer science to business departments very busy to digest these upcoming, fast pacing concepts.Academies should, and do, offer new tracks to support these developments
  • This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • We’ll start with the why and then the what will be better understood.Big Data is a business / technological aspect of a wider social phenomena we’re currently leave in.As all past social revolutions, they were all started with a technological revolution, e.g. the French revolution was a side effect of the industrial revolution.This is a same case where the Internet created a social revolutionEveryone is connected to everyone
  • Actually the Big Data as a phenomena started with the rise of Web2.0, where unlike the older Web 1.o, where only site owners created the online data, then came the users which create the content
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • Big Data -> big numbers.Taken from http://visual.ly/what-big-data
  • Big Users is an equally big trend driving developers to use NoSQL databases.Most new applications are made available over the internet so people can easily access them.This has caused the number of simultaneous users for many applications to explode.The number of people connected to the internet is more than 2B and growing rapidly.The number of hours that the average user spends on the internet is growing too further increasing the number of simultaneous users.And, with the proliferation of smart phones, people use their applications more and more frequently further increasing the number of simultaneous users.All these simultaneous users leads to a rapidly growing number of database operations and the need for a far easier way to scale your database to meet these demands.Taken from Couchbase deck @ IGTCloud summit 2013http://www.go-gulf.com/blog/online-timehttp://business.time.com/2012/02/14/one-billion-smartphones-by-2016-here-comes-the-mobile-arms-race/
  • To summarize, the technology implications of the Big Data, Big User, and Cloud Computing mega trends are causing people to seriously rethink what database they use for their applications and are increasingly coming to the conclusion that NoSQL databases are a better fit than relational databases.
  • Finally, the move to cloud computing and SaaS business models is also driving developers to consider NoSQL databases.15 years ago most applications were developed with a client/server architecture and a packaged software business model that supported the needs of users on a company-by-company basis.Today, applications are increasingly developed using a 3-tier internet architecture, are cloud-based, and use a Software-as-a-Service business model that needs to support the collective needs of thousandsvof customersThis approach increasingly requires a horizontally scalable architecture that easily scales with the number of users and amount of data your application has.
  • The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  • Outbrain serves 8 billion impressions a month = 3000 impressions / sec ; DG (MediaMind) serves 50 billion a day = 500K/sechttp://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-datahttp://www.computerworlduk.com/in-depth/applications/1779/oracles-database-machine-how-much-will-it-really-cost/
  • http://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-data
  • MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  • MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  • Taken from http://db-engines.com/en/ranking
  • This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  • Ok, we have the big data. Now, what are we doing with it?Big data is important if you want to be successful in analytic processing. But, why is that important? The answer is that success in a highly competitive, fast-moving marketplace is determined by who can capitalize on business opportunities before everyone else seizes the same opportunity. In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data
  • Although Onavo has started from a service that optimizes devices & apps performance, on the way they’ve collected logs from these apps & devices and became one of the leading mobile analytics aggregators in the world
  • Notations first.It has many names that mean more or less the same: the art of inference insights from data
  • In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data.Domains applying data science / data mining.. Vary:
  • Learning is comprised of three steps: First, we build our probabilistic model of the real worldThen, we train the model with labeled (supervised) examples, i.e. this is a car, this is not a car. This takes place offline.Last, online, we feed the model with a totally new example and expect it will predict for us the correct prediction
  • Drew Conway, http://www.dataists.com/2010/09/the-data-science-venn-diagram
  • Transcript of "The Rise of Big Data Science"

    1. 1. The Rise of Big Data Science GILAD BARKAN
    2. 2. Big Data Science Big Data Data Science Big Data Science
    3. 3. Big Data  Why ?  What ?  How ?
    4. 4. Big Data  Why ?  What ?  How ?
    5. 5. Why Big Data ?  It’s the flooded information era we live in  In a world where data is power, big data is big power
    6. 6. Why Big Data ?  Web 2.0
    7. 7. Why should we care about Big Data ?  The big business opportunities  Competitive fast moving marketplace  Capitalize on business opportunities before everyone else Existing channels to every person on the planet  Maximizing revenues from customers  Segment-of-1 - more personal customer experiences 
    8. 8. Big Data  Why ?  What ?  How ?
    9. 9. What is Big Data ?  The 3 V’s Volume Variety Velocity
    10. 10. What is Big Data ?  The 3 V’s Volume Variety Velocity
    11. 11. Big Data - Volume
    12. 12. Big Data - Volume Big Users More Users, All the Time 2 35 1 + Billion Global Online Population Billion Hours Hours Spent Online Billion Smartphone Users
    13. 13. More Users More Data + Big Data
    14. 14. What is Big Data ?  The 3 V’s Volume Variety Velocity
    15. 15. Big Data - Variety Trillions of Gigabytes (Zettabytes)  Heterogeneous sources of data  Structured Un/SemiStructured Data  Unstructured Structured Data Audio images tables text video 700 MB / movie Text, Log Files, Click 5000 KB / song Streams, Blogs, T weets, Audio, Vide o, etc. 1000 KB / image 5 KB / record Traditional Structured SQL 50 KB / record Unstructured NoSQL
    16. 16. What is Big Data ?  The 3 V’s Volume Variety Velocity
    17. 17. Big Data - Velocity  How the hell does Google return an answer in 0.28 seconds by looking at 4 Billion pages?
    18. 18. Big Data - Velocity  Online Advertisement - Real Time Bidding (RTB)
    19. 19. Big Data - Velocity  Recommendations
    20. 20. Big Data  Why ?  What ?  How ?
    21. 21. How is Big Data Handled ?  The challenge is huge  Store, analyze and serve huge volume of variety of data in high velocity  We can’t achieve this using a single machine, no matters how strong it is. Why? Expensive – stay tuned  Load balancing requests  Outbrain serves 3,000 per second  DG (MediaMind) serves 500K per second!!!   Not fault tolerant
    22. 22. The Big Data Paradigms Shifts Volume Distributing the Data Scale Out Scale Up (Horizontal) (Vertical) SQL Server Hadoop Cluster HDFS (GFS) Nodes
    23. 23. Big Data –Reducing Costs  Hadoop is a 5 times cheaper infrastructure !!!  TCO (purchase + maintenance) for 3 years per 300 TB: DBMS server = 5 M$ 75 nodes cluster = 1 M$
    24. 24. Big Data Paradigm Shift - Computing MapReduce Computing Paradigm  Exploiting the distributed architecture for large scale computations in parallel
    25. 25. MapReduce  “Hello MapReduce” – counting words Map Mappers W the C the 7 Cow 1 quick 0 W C the 9 Cow Hadoop Cluster 2 W URL 2 0 quick 1 quick 3 Reduce 5 Cow Master C Reducer + W C the 21 Cow 2 quick 5
    26. 26. Big Data Paradigm Shift – NoSQL Variety  Schema-less databases to support the variety of data  Complex SQL queries (joins, etc.) in a distributed data framework is extremely inefficient   Key-Value Store NoSQL Key Value user_id Any – not single primary as in SQL tables url text image_id video_id images video any
    27. 27. Big Data Paradigm Shift – Velocity  RAM-based DBs instead of traditional disk-based DBs  Store critical data in memory (much more expensive)  If the data doesn't come to Alg - Alg will come to the data Alg Write Read Data Alg Read Write Data traditional today
    28. 28. Big Data - Summary
    29. 29. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Technological paradigm shifts
    30. 30. Big Data Technological Paradigm Shifts Volume Scale up Map Variety NoSQL Scale Out Mappers Key Value Velocity Reduce Alg Alg Data Master Reducer Data
    31. 31. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies
    32. 32. Flood of New Big Data Technologies  Open Source
    33. 33. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz
    34. 34. Big Buzz ?
    35. 35. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz It’s a real response to the world hectic paced evolution  reducing costs by order of magnitude   Still it doesn’t mean every business today will / should transform its technology stack to support big data
    36. 36. Big Data Science Big Data Data Science Big Data Science
    37. 37. Data Science  Why ?  What ?  How ?
    38. 38. Data Science  Why ?  What ?  How ?
    39. 39. Why Data Science ? data scientists
    40. 40. Data is a real value  Facebook acquires Onavo for ~150M$
    41. 41. Data Science  Why ?  What ?  How ?
    42. 42. Welcome to the Intelligent world Data Analysis Data Mining Data Analytics Data Science Automatic Decisioning Machine Learning Predictive Analytics
    43. 43. Data Miners are the New Gold Miners
    44. 44. Search
    45. 45. Online Advertisement - Real Time Bidding (RTB)
    46. 46. Recommendations  Recommendations
    47. 47. Text Analysis
    48. 48. CRM – Customers Churn Prediction
    49. 49. Time Series Analysis
    50. 50. Machine Learning  Classification  Clustering  Regression  Recommendation
    51. 51. Classification Amdocs Insight™ - why is the customer calling the Call Center ? Pay Bill Third Party Charges Bill too high Overage Abnormal fee
    52. 52. Clustering Market Segmentation Social Network Analysis
    53. 53. Regression  Housing price prediction 400 Price ($) in 1000’s 300 280 215 200 100 50 100 130 150 Size in m2 200 250
    54. 54. The Data Scientist
    55. 55. Data Scientist Skillset Hands on tools, languages, technologies MsC / PhD in Math, CS, Stats, Physics Hands on the specific problem domain
    56. 56. Data Science ≠ BI  Apply advanced statistical machine learning algorithms to: dig deeper to find patterns that traditional BI tools may not reveal  much wider domains / applications spectrum   Predictive Analytics ≠ Exploratory Analytics
    57. 57. Predictive Analytics Data Science Big Data Science Vs. Exploratory Analytics Business Intelligence Traditional BI Exploratory Analytics
    58. 58. Academia Response to Data Science
    59. 59. Data Science  Why ?  What ?  How ?
    60. 60. The Art of Data Science  We need at least one semester course for it  Still…
    61. 61. Data Science Life Cycle Run Time Offline Data Analysis Understand Data Prepare Data Monitor Business Goal Deploy Model Evaluate
    62. 62. Closing the Loop  Technically wise, what do you think?  Is Big Data good or bad for Data Science ? Big Data Data Science Big Data Science
    63. 63. The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
    64. 64. The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
    65. 65. The Good - The Statistical View  Statistics is predictive analytics’ fuel !  The more data you have (Big Data) the better your predictive models will perform
    66. 66. Law of Large Numbers
    67. 67. Law of Large Numbers
    68. 68. Law of Large Numbers
    69. 69. Law of Large Numbers
    70. 70. Law of Large Numbers
    71. 71. Law of Large Numbers
    72. 72. Combining the Good & Bad  Data is a function of quality and quantity High Quality Low Small Quantity Big
    73. 73. Big Data Science - Summary  Big Data   Big Numbers  Big Opportunities  Big Data is the buzziest technology nowadays  Data Scientists  the ones that coax the treasures for their companies, out of the big data  Are multi-discipline skilled  the new industry rock stars
    74. 74. Thank You for your attention
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×