SlideShare a Scribd company logo
1 of 48
Haifa Big Data Meetup - Meeting 1
Introduction to Big Data
Organizer + Lecture – Nathan Krasney
Nathan Krasney 23/6/15 1
Introduction to Big Data
• Big Data use cases
• What is Big Data:
– Definitions
– Technologies
• Why is the future so bright for Big Data
Nathan Krasney 23/6/15 2
Use Cases – A
• http://www.ted.com/playlists/56/making_sense_
of_too_much_data
• We have in recent years huge amount of data
coming from users : Blogs, Web Sites, Forums
,Facebook , YouTube, LinkedIn,…
• Data is mostly personal : post, like , profile, …
• Data contains personal preferences , geographic
location, …. of hundreds million of people in a
scale that did not exist few years ago.
• It is possible to process this data using Machine
Learning algorithm to get very interesting
personal characteristics of people
Nathan Krasney 23/6/15 3
Use Cases – A con’d
Nathan Krasney 23/6/15 4
Facebook Active Users Per Month [in millions]
Use Cases – A con’d
What kind of info can we produce by processing
data on the web ?
• Political preferences
• Personal characteristics
• Age
• Gender
• Religious
• Intelligence
• Consumer preferences
Nathan Krasney 23/6/15 5
Use Cases – A1
Example 1 : facebook likes
A research conducted lately has found the top 5
likes which indicated intelligent people
For example clicking on this page. But why ?
Nathan Krasney 23/6/15 6
Use Cases – A1 con’d
in general ,people tends to choose their friend to be like
them. For example , young people will choose young
people as their friends, smart people will choose smart
people as their friends and so on.
It turns out that this particular page was liked by a group
of intelligent people and it spread on the web virally
via the likes of their friends (who also have high
intelligence).
But this could be concluded only by having big data and
being able to process it to come out with this
conclusion.
Nathan Krasney 23/6/15 7
Use Cases – A2
Nathan Krasney 23/6/15 8
Example 2 - Forbes magazine
a company name Target started to send
particular family suggestions for baby
clothing even before the daughter has told
her parents she is pregnant. How did Target
know about it ?
Use Cases – A2 con’d
• It turns out that the company -
https://corporate.target.com/ has huge data base of
shopping done on their stores. Furthermore, the company
has smart algorithm that identify pregnancy given the
shopping a woman does at Target
• The algorithm identify the pregnancy due date !!!
• The algorithm has identified the girl pregnancy not
necessarily given baby products bought but by vitamins
she bought and bigger hand bag (for dippers) and other
indirect characteristics
• Sales of the company in 2014 have reached 71 billion $ and
the company exist from 1902 so she quite big data …
Nathan Krasney 23/6/15 9
Use Cases – A2 con’d
• The huge data – big data that Target has
gathered about her customers and their
purchases has allowed the company to get
Behavioral Patterns that indicated coming
pregnancy using purchase of items like
vitamins , bigger bag and so on
Nathan Krasney 23/6/15 10
Use Cases – A3
Example 3
• Processing the huge amount of personal data
that publically exist on the web : Facebook ,
LinkedIn , forums , web sites , blogs , YouTube,
Instegram ,… to predict personal profile. This can
help e.g. HR offices, Companies hiring people…
• Identifying the social group you belong to using
clustering can further improve this predicted
profile
• Better prediction of the user profile worth more
money
Nathan Krasney 23/6/15 11
What is Big Data ?
Nathan Krasney 23/6/15 12
• 3 V’s :
– Volume
– Velocity
– Variety
What is Big Data ? Con’d
Nathan Krasney 23/6/15 13
What is Big Data ? Con’d
Nathan Krasney 23/6/15 14
What is Big Data ? Con’d
Nathan Krasney 23/6/15 15
What is Big Data ? Con’d
Nathan Krasney 23/6/15 16
‫ה‬ ‫שלושת‬v–‫אחר‬ ‫מכיוון‬ ‫ים‬
What is Big Data ? Con’d
• Data model - what fields of data will be stored and how
: data type and any restrictions on the data input
• Structured data – data model based e.g. relational
database. Need schema
• Unstructured Data – no data model e.g. E-mails, pdf
files, web pages, videos, audios , photos. Schema free.
Suits NoSQL
• Batch : offline processing. e.g. by Hadoop
• Streaming : online processing (real-time) . E.g. by Spark
• Terabyte – 1,000 GB
• Zettabyte – 1,000,000,000 TB
Nathan Krasney 23/6/15 17
What is Big Data ? Con’d
Nathan Krasney 23/6/15 18
‫ה‬ ‫שלושת‬v–‫נוסף‬ ‫מכיוון‬ ‫ים‬
What is Big Data ? Con’d
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
19Nathan Krasney 23/6/15
Who’s Generating Big Data
What is Big Data ? Con’d
Batch use case – Blackberry (good times stat…)
Data :
• Instrumentation data from devices
• 650 TB daily, 100 PB total
Processing is used for business analytics e.g.
view graphs
Nathan Krasney 23/6/15 20
What is Big Data ? Con’d
Batch use case – CBS Interactive (online content
network for information and entertainment.)
Data :
• 1 PB of content , click streams , web logs
• 1 PB events tracked daily
Processing is used for business analytics e.g. to
identify user patterns e.g. “high value” users
to target content
Nathan Krasney 23/6/15 21
What is Big Data ? Con’d
Streaming use case – Cyber security (fraud
detection) by RSA
Machine learning may stop credit card
transaction which are suspicious. E.g. an
Israeli person buy a lot online , however, once
he travel to china he might be blocked for the
same online buy.
Nathan Krasney 23/6/15 22
What is Big Data ? Con’d
So we have gathered huge amount of data, now
what ?
The problem – processing big data
Traditional large scale computation used
strong computer (super computer):
• faster processors
• more memory
Nathan Krasney 23/6/15 23
What is Big Data ? Con’d
but even this was not enough
Better solution is distributed system -
use multiple machine for single job.
But this also has its problems :
• programming complexity - keeping
data and processes in sync
• finite bandwidth
• partial failures - e.g. one computer
fails should not keep the system down
Nathan Krasney 23/6/15 24
What is Big Data ? Con’d
modern systems have much more data
• terabytes (1000 gigabytes) a day
• petabytes (1000 terabyte) total
The approach of central data place is not
suitable for big data
Nathan Krasney 23/6/15 25
What is Big Data ? Con’d
Nathan Krasney 23/6/15 26
What is Big Data ? Con’d
The new approach – Apache Hadoop
A software framework for storing , processing and
analyzing big data
• Distributed
• scalable
• fault tolerant
• open source
• Eco system
Nathan Krasney 23/6/15 27
What is Big Data ? Con’d
The new approach – Hadoop
Hadoop core components :
• HDFS (Hadoop Distributed File System) - store
the data on the cluster
• MapReduce - process the data on the cluster
Nathan Krasney 23/6/15 28
What is Big Data ? Con’d
HDFS basic concepts
• HDFS is a file system written in java
• Sit on top of native file system e.g. Linux
• storage of massive amount of data :
– scalable
– fault tolerant
– supports efficient processing with MapReduce
Nathan Krasney 23/6/15 29
What is Big Data ? Con’d
HDFS basic concepts
Cluster may hundreds or thousands of servers
Nathan Krasney 23/6/15 30
What is Big Data ? Con’d
HDFS basic concepts
How files are stored
• Data files are splited into blocks and distributed to
the data nodes(computer)
• Each block is replicated on multiple node (3 is
default)
• NameNode stores metadata
Nathan Krasney 23/6/15 31
What is Big Data ? Con’d
HDFS basic concepts
Nathan Krasney 23/6/15 32
What is Big Data ? Con’d
Get data in  out of HDFS
Nathan Krasney 23/6/15 33
What is Big Data ? Con’d
MapReduce
MapReduce has 3 main phases :
phase 1 - The Mapper
• Each task works (typically) on one HDFS block
• Map task run (typically) on the same node where the block is stored
phase 2 - Shuffle & Sort
• sort and collect all intermediate data from all mappers
• happens after all Map tasks are completed
phase 3 - The Reducer
• operate on sorted  shuffled intermediate data - previous phase
output
• produces final output
Nathan Krasney 23/6/15 34
What is Big Data ? Con’d
Example : counting words
Nathan Krasney 23/6/15 35
What is Big Data ? Con’d
Phase 1 - The mapper map the text
Nathan Krasney 23/6/15 36
What is Big Data ? Con’d
Phase 2 - Shuffle & Sort
Nathan Krasney 23/6/15 37
What is Big Data ? Con’d
Phase 3 – Reduce
Nathan Krasney 23/6/15 38
What is Big Data ? Con’d
It is important to understand that :
• Map tasks run in parallel - this reduce computation
time.
• Map tasks run on the machines that contains the
data so there is no network traffic issues
• Reduce also runs in parallel
Nathan Krasney 23/6/15 39
What is Big Data ? Con’d
Core Hadoop concepts :
• applications are written in high level languages
• nodes talk to each other as little as possible
• data is distributed in advanced
• data is replicated for increased availability and
reliability
• Hadoop is scalable and fault tolerant
Nathan Krasney 23/6/15 40
What is Big Data ? Con’d
Fault tolerance :
• node failure is inevitable
• what to do in this case :
– system continues to function
– master re-assign tasks to a different node
– data replication - so no lost of data
– node which recover rejoin the cluster
automatically
Nathan Krasney 23/6/15 41
What is Big Data ? Con’d
Scalability means
• adding more nodes is linearly proportional to
capacity
• increase load result in graceful decline in
performance and not failure
Nathan Krasney 23/6/15 42
What is Big Data ? Con’d
Hadoop Eco system
Nathan Krasney 23/6/15 43
What is Big Data ? Con’d
Hadoop Ecosystem
• querying data : Hive , Pig, Impala
• Data store : Hbase (Big table like over HDFS)
• get data into HDFS : Flume
• Schedulers (e.g. Hadoop Map/Reduce jobs, Pig
jobs): Oozie
• Machine Learning : Mahout
Nathan Krasney 23/6/15 44
What is Big Data ? Con’d
Who uses Hadoop
Nathan Krasney 23/6/15 45
What is Big Data ? Con’d
Spark
The problem : MapReduce may be slow and does only
batch processing
Solution – Spark
• Can do both batch and streaming
• Apache Spark processes data in-memory while Hadoop
MapReduce persists back to the disk after a map or
reduce action. Up to X100 better processing time
Nathan Krasney 23/6/15 46
What is Big Data ? Con’d
NoSQL (Not only SQL)
The problem : storage and retrieval
of unstructured data, typically huge amount of it.
The solution :
• NoSQL database
• The data structures used by NoSQL databases :
– key-value : key is the identifier
– Graph : nodes + edges to represent relationship
– document : store data as JSON document (MongoDB ,
CouchDB,..)
– …
Nathan Krasney 23/6/15 47
Why is the future so bright for Big Data
• IOT (Internet Of Things) will add huge amount of data
in the coming years
• Cloud allows us to save easily a lot of data
• More data is stored as time goes by on the net,
Companies , institutions,…
• Data processing abilities improves As time goes by
(Hadoop , Spark)
• the ability to store huge amount of data improves as
time goes by
• The ability to store more data + better processing leads
to smarter info that can be retrieved from the data
• Smart info is power = money
Nathan Krasney 23/6/15 48

More Related Content

What's hot

What's hot (20)

Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
BigData
BigDataBigData
BigData
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Hadoop explained [e book]
Hadoop explained [e book]Hadoop explained [e book]
Hadoop explained [e book]
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data Technologies
Introduction to Big Data TechnologiesIntroduction to Big Data Technologies
Introduction to Big Data Technologies
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data
Big DataBig Data
Big Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Big Data 101
Big Data 101Big Data 101
Big Data 101
 

Viewers also liked

Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera manager
Co-graph Inc.
 

Viewers also liked (20)

קורס אנדרואיד
קורס אנדרואידקורס אנדרואיד
קורס אנדרואיד
 
Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera manager
 
Marketing strategy
Marketing strategy Marketing strategy
Marketing strategy
 
CS Seminar- After Sale
CS Seminar- After SaleCS Seminar- After Sale
CS Seminar- After Sale
 
CS Seminar - Reports
CS Seminar - ReportsCS Seminar - Reports
CS Seminar - Reports
 
eBay@Cospace
eBay@Cospace eBay@Cospace
eBay@Cospace
 
Diamonds cheat sheet mail
Diamonds cheat sheet mailDiamonds cheat sheet mail
Diamonds cheat sheet mail
 
New standards 2016
New standards 2016New standards 2016
New standards 2016
 
CS Seminar - Seller Hub
CS Seminar - Seller HubCS Seminar - Seller Hub
CS Seminar - Seller Hub
 
רונן הורן, השיווק המודרני- הגדלת המכירות שלך באמצעות מדעי המוח
רונן הורן, השיווק המודרני- הגדלת המכירות שלך באמצעות מדעי המוחרונן הורן, השיווק המודרני- הגדלת המכירות שלך באמצעות מדעי המוח
רונן הורן, השיווק המודרני- הגדלת המכירות שלך באמצעות מדעי המוח
 
eBay -שיפור מערך המכירות שלכם- כל הטיפים על בנית תהליך מכירה נכון ב
eBay -שיפור מערך המכירות שלכם- כל הטיפים על בנית תהליך מכירה נכון בeBay -שיפור מערך המכירות שלכם- כל הטיפים על בנית תהליך מכירה נכון ב
eBay -שיפור מערך המכירות שלכם- כל הטיפים על בנית תהליך מכירה נכון ב
 
CS Seminar - Returns
CS Seminar - ReturnsCS Seminar - Returns
CS Seminar - Returns
 
Best practices july 2016
Best practices july 2016Best practices july 2016
Best practices july 2016
 
New standards 2016 (1)
New standards 2016 (1)New standards 2016 (1)
New standards 2016 (1)
 
Announcments 2016
Announcments 2016Announcments 2016
Announcments 2016
 
סוף עונה- הצגת כלים לקידום מכירות והוספת מבצעים
סוף עונה- הצגת כלים לקידום מכירות והוספת מבצעיםסוף עונה- הצגת כלים לקידום מכירות והוספת מבצעים
סוף עונה- הצגת כלים לקידום מכירות והוספת מבצעים
 
CS seminar - ISP
CS seminar - ISPCS seminar - ISP
CS seminar - ISP
 
Bset practices
Bset practicesBset practices
Bset practices
 
מעריכה מחדש את ביצוע המוכרים eBay נהלים חדשים להערכת מוכרים- למדו כיצד
מעריכה מחדש את ביצוע המוכרים eBay נהלים חדשים להערכת מוכרים- למדו כיצדמעריכה מחדש את ביצוע המוכרים eBay נהלים חדשים להערכת מוכרים- למדו כיצד
מעריכה מחדש את ביצוע המוכרים eBay נהלים חדשים להערכת מוכרים- למדו כיצד
 
אלעד גולדנברג: מהפכה בת 20. אלפי שנות מסחר השתנו בקליק
אלעד גולדנברג: מהפכה בת 20. אלפי שנות מסחר השתנו בקליקאלעד גולדנברג: מהפכה בת 20. אלפי שנות מסחר השתנו בקליק
אלעד גולדנברג: מהפכה בת 20. אלפי שנות מסחר השתנו בקליק
 

Similar to Introduction to big data

Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 

Similar to Introduction to big data (20)

Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
L21 Big Data and Analytics
L21 Big Data and AnalyticsL21 Big Data and Analytics
L21 Big Data and Analytics
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Big data
Big dataBig data
Big data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
DBMS
DBMSDBMS
DBMS
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 

More from Nathan Krasney

More from Nathan Krasney (18)

Introduction to Semantic ui-react
Introduction to Semantic ui-reactIntroduction to Semantic ui-react
Introduction to Semantic ui-react
 
React introduction
React introductionReact introduction
React introduction
 
Angular 2 jump start
Angular 2 jump startAngular 2 jump start
Angular 2 jump start
 
Angular 2 introduction
Angular 2 introductionAngular 2 introduction
Angular 2 introduction
 
Angular 2 - Typescript
Angular 2  - TypescriptAngular 2  - Typescript
Angular 2 - Typescript
 
Angular 2 binding
Angular 2  bindingAngular 2  binding
Angular 2 binding
 
ADO.Net
ADO.NetADO.Net
ADO.Net
 
JQuery
JQueryJQuery
JQuery
 
ASP.net Security
ASP.net SecurityASP.net Security
ASP.net Security
 
ASP.net Web Pages
ASP.net Web PagesASP.net Web Pages
ASP.net Web Pages
 
ASP.net MVC
ASP.net MVCASP.net MVC
ASP.net MVC
 
CSS
CSSCSS
CSS
 
Javascript with json
Javascript with jsonJavascript with json
Javascript with json
 
javascript
javascriptjavascript
javascript
 
Javascript ajax
Javascript ajaxJavascript ajax
Javascript ajax
 
HTML5
HTML5 HTML5
HTML5
 
HTML
HTML HTML
HTML
 
Lessons learned from 6 month project with india based software house
Lessons learned from 6 month project with india based software houseLessons learned from 6 month project with india based software house
Lessons learned from 6 month project with india based software house
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 

Introduction to big data

  • 1. Haifa Big Data Meetup - Meeting 1 Introduction to Big Data Organizer + Lecture – Nathan Krasney Nathan Krasney 23/6/15 1
  • 2. Introduction to Big Data • Big Data use cases • What is Big Data: – Definitions – Technologies • Why is the future so bright for Big Data Nathan Krasney 23/6/15 2
  • 3. Use Cases – A • http://www.ted.com/playlists/56/making_sense_ of_too_much_data • We have in recent years huge amount of data coming from users : Blogs, Web Sites, Forums ,Facebook , YouTube, LinkedIn,… • Data is mostly personal : post, like , profile, … • Data contains personal preferences , geographic location, …. of hundreds million of people in a scale that did not exist few years ago. • It is possible to process this data using Machine Learning algorithm to get very interesting personal characteristics of people Nathan Krasney 23/6/15 3
  • 4. Use Cases – A con’d Nathan Krasney 23/6/15 4 Facebook Active Users Per Month [in millions]
  • 5. Use Cases – A con’d What kind of info can we produce by processing data on the web ? • Political preferences • Personal characteristics • Age • Gender • Religious • Intelligence • Consumer preferences Nathan Krasney 23/6/15 5
  • 6. Use Cases – A1 Example 1 : facebook likes A research conducted lately has found the top 5 likes which indicated intelligent people For example clicking on this page. But why ? Nathan Krasney 23/6/15 6
  • 7. Use Cases – A1 con’d in general ,people tends to choose their friend to be like them. For example , young people will choose young people as their friends, smart people will choose smart people as their friends and so on. It turns out that this particular page was liked by a group of intelligent people and it spread on the web virally via the likes of their friends (who also have high intelligence). But this could be concluded only by having big data and being able to process it to come out with this conclusion. Nathan Krasney 23/6/15 7
  • 8. Use Cases – A2 Nathan Krasney 23/6/15 8 Example 2 - Forbes magazine a company name Target started to send particular family suggestions for baby clothing even before the daughter has told her parents she is pregnant. How did Target know about it ?
  • 9. Use Cases – A2 con’d • It turns out that the company - https://corporate.target.com/ has huge data base of shopping done on their stores. Furthermore, the company has smart algorithm that identify pregnancy given the shopping a woman does at Target • The algorithm identify the pregnancy due date !!! • The algorithm has identified the girl pregnancy not necessarily given baby products bought but by vitamins she bought and bigger hand bag (for dippers) and other indirect characteristics • Sales of the company in 2014 have reached 71 billion $ and the company exist from 1902 so she quite big data … Nathan Krasney 23/6/15 9
  • 10. Use Cases – A2 con’d • The huge data – big data that Target has gathered about her customers and their purchases has allowed the company to get Behavioral Patterns that indicated coming pregnancy using purchase of items like vitamins , bigger bag and so on Nathan Krasney 23/6/15 10
  • 11. Use Cases – A3 Example 3 • Processing the huge amount of personal data that publically exist on the web : Facebook , LinkedIn , forums , web sites , blogs , YouTube, Instegram ,… to predict personal profile. This can help e.g. HR offices, Companies hiring people… • Identifying the social group you belong to using clustering can further improve this predicted profile • Better prediction of the user profile worth more money Nathan Krasney 23/6/15 11
  • 12. What is Big Data ? Nathan Krasney 23/6/15 12 • 3 V’s : – Volume – Velocity – Variety
  • 13. What is Big Data ? Con’d Nathan Krasney 23/6/15 13
  • 14. What is Big Data ? Con’d Nathan Krasney 23/6/15 14
  • 15. What is Big Data ? Con’d Nathan Krasney 23/6/15 15
  • 16. What is Big Data ? Con’d Nathan Krasney 23/6/15 16 ‫ה‬ ‫שלושת‬v–‫אחר‬ ‫מכיוון‬ ‫ים‬
  • 17. What is Big Data ? Con’d • Data model - what fields of data will be stored and how : data type and any restrictions on the data input • Structured data – data model based e.g. relational database. Need schema • Unstructured Data – no data model e.g. E-mails, pdf files, web pages, videos, audios , photos. Schema free. Suits NoSQL • Batch : offline processing. e.g. by Hadoop • Streaming : online processing (real-time) . E.g. by Spark • Terabyte – 1,000 GB • Zettabyte – 1,000,000,000 TB Nathan Krasney 23/6/15 17
  • 18. What is Big Data ? Con’d Nathan Krasney 23/6/15 18 ‫ה‬ ‫שלושת‬v–‫נוסף‬ ‫מכיוון‬ ‫ים‬
  • 19. What is Big Data ? Con’d Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 19Nathan Krasney 23/6/15 Who’s Generating Big Data
  • 20. What is Big Data ? Con’d Batch use case – Blackberry (good times stat…) Data : • Instrumentation data from devices • 650 TB daily, 100 PB total Processing is used for business analytics e.g. view graphs Nathan Krasney 23/6/15 20
  • 21. What is Big Data ? Con’d Batch use case – CBS Interactive (online content network for information and entertainment.) Data : • 1 PB of content , click streams , web logs • 1 PB events tracked daily Processing is used for business analytics e.g. to identify user patterns e.g. “high value” users to target content Nathan Krasney 23/6/15 21
  • 22. What is Big Data ? Con’d Streaming use case – Cyber security (fraud detection) by RSA Machine learning may stop credit card transaction which are suspicious. E.g. an Israeli person buy a lot online , however, once he travel to china he might be blocked for the same online buy. Nathan Krasney 23/6/15 22
  • 23. What is Big Data ? Con’d So we have gathered huge amount of data, now what ? The problem – processing big data Traditional large scale computation used strong computer (super computer): • faster processors • more memory Nathan Krasney 23/6/15 23
  • 24. What is Big Data ? Con’d but even this was not enough Better solution is distributed system - use multiple machine for single job. But this also has its problems : • programming complexity - keeping data and processes in sync • finite bandwidth • partial failures - e.g. one computer fails should not keep the system down Nathan Krasney 23/6/15 24
  • 25. What is Big Data ? Con’d modern systems have much more data • terabytes (1000 gigabytes) a day • petabytes (1000 terabyte) total The approach of central data place is not suitable for big data Nathan Krasney 23/6/15 25
  • 26. What is Big Data ? Con’d Nathan Krasney 23/6/15 26
  • 27. What is Big Data ? Con’d The new approach – Apache Hadoop A software framework for storing , processing and analyzing big data • Distributed • scalable • fault tolerant • open source • Eco system Nathan Krasney 23/6/15 27
  • 28. What is Big Data ? Con’d The new approach – Hadoop Hadoop core components : • HDFS (Hadoop Distributed File System) - store the data on the cluster • MapReduce - process the data on the cluster Nathan Krasney 23/6/15 28
  • 29. What is Big Data ? Con’d HDFS basic concepts • HDFS is a file system written in java • Sit on top of native file system e.g. Linux • storage of massive amount of data : – scalable – fault tolerant – supports efficient processing with MapReduce Nathan Krasney 23/6/15 29
  • 30. What is Big Data ? Con’d HDFS basic concepts Cluster may hundreds or thousands of servers Nathan Krasney 23/6/15 30
  • 31. What is Big Data ? Con’d HDFS basic concepts How files are stored • Data files are splited into blocks and distributed to the data nodes(computer) • Each block is replicated on multiple node (3 is default) • NameNode stores metadata Nathan Krasney 23/6/15 31
  • 32. What is Big Data ? Con’d HDFS basic concepts Nathan Krasney 23/6/15 32
  • 33. What is Big Data ? Con’d Get data in out of HDFS Nathan Krasney 23/6/15 33
  • 34. What is Big Data ? Con’d MapReduce MapReduce has 3 main phases : phase 1 - The Mapper • Each task works (typically) on one HDFS block • Map task run (typically) on the same node where the block is stored phase 2 - Shuffle & Sort • sort and collect all intermediate data from all mappers • happens after all Map tasks are completed phase 3 - The Reducer • operate on sorted shuffled intermediate data - previous phase output • produces final output Nathan Krasney 23/6/15 34
  • 35. What is Big Data ? Con’d Example : counting words Nathan Krasney 23/6/15 35
  • 36. What is Big Data ? Con’d Phase 1 - The mapper map the text Nathan Krasney 23/6/15 36
  • 37. What is Big Data ? Con’d Phase 2 - Shuffle & Sort Nathan Krasney 23/6/15 37
  • 38. What is Big Data ? Con’d Phase 3 – Reduce Nathan Krasney 23/6/15 38
  • 39. What is Big Data ? Con’d It is important to understand that : • Map tasks run in parallel - this reduce computation time. • Map tasks run on the machines that contains the data so there is no network traffic issues • Reduce also runs in parallel Nathan Krasney 23/6/15 39
  • 40. What is Big Data ? Con’d Core Hadoop concepts : • applications are written in high level languages • nodes talk to each other as little as possible • data is distributed in advanced • data is replicated for increased availability and reliability • Hadoop is scalable and fault tolerant Nathan Krasney 23/6/15 40
  • 41. What is Big Data ? Con’d Fault tolerance : • node failure is inevitable • what to do in this case : – system continues to function – master re-assign tasks to a different node – data replication - so no lost of data – node which recover rejoin the cluster automatically Nathan Krasney 23/6/15 41
  • 42. What is Big Data ? Con’d Scalability means • adding more nodes is linearly proportional to capacity • increase load result in graceful decline in performance and not failure Nathan Krasney 23/6/15 42
  • 43. What is Big Data ? Con’d Hadoop Eco system Nathan Krasney 23/6/15 43
  • 44. What is Big Data ? Con’d Hadoop Ecosystem • querying data : Hive , Pig, Impala • Data store : Hbase (Big table like over HDFS) • get data into HDFS : Flume • Schedulers (e.g. Hadoop Map/Reduce jobs, Pig jobs): Oozie • Machine Learning : Mahout Nathan Krasney 23/6/15 44
  • 45. What is Big Data ? Con’d Who uses Hadoop Nathan Krasney 23/6/15 45
  • 46. What is Big Data ? Con’d Spark The problem : MapReduce may be slow and does only batch processing Solution – Spark • Can do both batch and streaming • Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. Up to X100 better processing time Nathan Krasney 23/6/15 46
  • 47. What is Big Data ? Con’d NoSQL (Not only SQL) The problem : storage and retrieval of unstructured data, typically huge amount of it. The solution : • NoSQL database • The data structures used by NoSQL databases : – key-value : key is the identifier – Graph : nodes + edges to represent relationship – document : store data as JSON document (MongoDB , CouchDB,..) – … Nathan Krasney 23/6/15 47
  • 48. Why is the future so bright for Big Data • IOT (Internet Of Things) will add huge amount of data in the coming years • Cloud allows us to save easily a lot of data • More data is stored as time goes by on the net, Companies , institutions,… • Data processing abilities improves As time goes by (Hadoop , Spark) • the ability to store huge amount of data improves as time goes by • The ability to store more data + better processing leads to smarter info that can be retrieved from the data • Smart info is power = money Nathan Krasney 23/6/15 48