This document summarizes an introduction to big data presentation. It defines big data as high volume, velocity, and variety of structured and unstructured data. It provides examples of how companies like Facebook and Target use big data analytics to gain insights into user preferences. The document also discusses technologies like Hadoop, Spark, and NoSQL that help process and analyze large datasets. Finally, it notes that the future is bright for big data due to growing data sources, improved processing abilities, and the ability to extract valuable insights from big data.
Architecture decision records - How not to get lost in the past
Introduction to big data
1. Haifa Big Data Meetup - Meeting 1
Introduction to Big Data
Organizer + Lecture – Nathan Krasney
Nathan Krasney 23/6/15 1
2. Introduction to Big Data
• Big Data use cases
• What is Big Data:
– Definitions
– Technologies
• Why is the future so bright for Big Data
Nathan Krasney 23/6/15 2
3. Use Cases – A
• http://www.ted.com/playlists/56/making_sense_
of_too_much_data
• We have in recent years huge amount of data
coming from users : Blogs, Web Sites, Forums
,Facebook , YouTube, LinkedIn,…
• Data is mostly personal : post, like , profile, …
• Data contains personal preferences , geographic
location, …. of hundreds million of people in a
scale that did not exist few years ago.
• It is possible to process this data using Machine
Learning algorithm to get very interesting
personal characteristics of people
Nathan Krasney 23/6/15 3
4. Use Cases – A con’d
Nathan Krasney 23/6/15 4
Facebook Active Users Per Month [in millions]
5. Use Cases – A con’d
What kind of info can we produce by processing
data on the web ?
• Political preferences
• Personal characteristics
• Age
• Gender
• Religious
• Intelligence
• Consumer preferences
Nathan Krasney 23/6/15 5
6. Use Cases – A1
Example 1 : facebook likes
A research conducted lately has found the top 5
likes which indicated intelligent people
For example clicking on this page. But why ?
Nathan Krasney 23/6/15 6
7. Use Cases – A1 con’d
in general ,people tends to choose their friend to be like
them. For example , young people will choose young
people as their friends, smart people will choose smart
people as their friends and so on.
It turns out that this particular page was liked by a group
of intelligent people and it spread on the web virally
via the likes of their friends (who also have high
intelligence).
But this could be concluded only by having big data and
being able to process it to come out with this
conclusion.
Nathan Krasney 23/6/15 7
8. Use Cases – A2
Nathan Krasney 23/6/15 8
Example 2 - Forbes magazine
a company name Target started to send
particular family suggestions for baby
clothing even before the daughter has told
her parents she is pregnant. How did Target
know about it ?
9. Use Cases – A2 con’d
• It turns out that the company -
https://corporate.target.com/ has huge data base of
shopping done on their stores. Furthermore, the company
has smart algorithm that identify pregnancy given the
shopping a woman does at Target
• The algorithm identify the pregnancy due date !!!
• The algorithm has identified the girl pregnancy not
necessarily given baby products bought but by vitamins
she bought and bigger hand bag (for dippers) and other
indirect characteristics
• Sales of the company in 2014 have reached 71 billion $ and
the company exist from 1902 so she quite big data …
Nathan Krasney 23/6/15 9
10. Use Cases – A2 con’d
• The huge data – big data that Target has
gathered about her customers and their
purchases has allowed the company to get
Behavioral Patterns that indicated coming
pregnancy using purchase of items like
vitamins , bigger bag and so on
Nathan Krasney 23/6/15 10
11. Use Cases – A3
Example 3
• Processing the huge amount of personal data
that publically exist on the web : Facebook ,
LinkedIn , forums , web sites , blogs , YouTube,
Instegram ,… to predict personal profile. This can
help e.g. HR offices, Companies hiring people…
• Identifying the social group you belong to using
clustering can further improve this predicted
profile
• Better prediction of the user profile worth more
money
Nathan Krasney 23/6/15 11
12. What is Big Data ?
Nathan Krasney 23/6/15 12
• 3 V’s :
– Volume
– Velocity
– Variety
13. What is Big Data ? Con’d
Nathan Krasney 23/6/15 13
14. What is Big Data ? Con’d
Nathan Krasney 23/6/15 14
15. What is Big Data ? Con’d
Nathan Krasney 23/6/15 15
16. What is Big Data ? Con’d
Nathan Krasney 23/6/15 16
ה שלושתv–אחר מכיוון ים
17. What is Big Data ? Con’d
• Data model - what fields of data will be stored and how
: data type and any restrictions on the data input
• Structured data – data model based e.g. relational
database. Need schema
• Unstructured Data – no data model e.g. E-mails, pdf
files, web pages, videos, audios , photos. Schema free.
Suits NoSQL
• Batch : offline processing. e.g. by Hadoop
• Streaming : online processing (real-time) . E.g. by Spark
• Terabyte – 1,000 GB
• Zettabyte – 1,000,000,000 TB
Nathan Krasney 23/6/15 17
18. What is Big Data ? Con’d
Nathan Krasney 23/6/15 18
ה שלושתv–נוסף מכיוון ים
19. What is Big Data ? Con’d
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
19Nathan Krasney 23/6/15
Who’s Generating Big Data
20. What is Big Data ? Con’d
Batch use case – Blackberry (good times stat…)
Data :
• Instrumentation data from devices
• 650 TB daily, 100 PB total
Processing is used for business analytics e.g.
view graphs
Nathan Krasney 23/6/15 20
21. What is Big Data ? Con’d
Batch use case – CBS Interactive (online content
network for information and entertainment.)
Data :
• 1 PB of content , click streams , web logs
• 1 PB events tracked daily
Processing is used for business analytics e.g. to
identify user patterns e.g. “high value” users
to target content
Nathan Krasney 23/6/15 21
22. What is Big Data ? Con’d
Streaming use case – Cyber security (fraud
detection) by RSA
Machine learning may stop credit card
transaction which are suspicious. E.g. an
Israeli person buy a lot online , however, once
he travel to china he might be blocked for the
same online buy.
Nathan Krasney 23/6/15 22
23. What is Big Data ? Con’d
So we have gathered huge amount of data, now
what ?
The problem – processing big data
Traditional large scale computation used
strong computer (super computer):
• faster processors
• more memory
Nathan Krasney 23/6/15 23
24. What is Big Data ? Con’d
but even this was not enough
Better solution is distributed system -
use multiple machine for single job.
But this also has its problems :
• programming complexity - keeping
data and processes in sync
• finite bandwidth
• partial failures - e.g. one computer
fails should not keep the system down
Nathan Krasney 23/6/15 24
25. What is Big Data ? Con’d
modern systems have much more data
• terabytes (1000 gigabytes) a day
• petabytes (1000 terabyte) total
The approach of central data place is not
suitable for big data
Nathan Krasney 23/6/15 25
26. What is Big Data ? Con’d
Nathan Krasney 23/6/15 26
27. What is Big Data ? Con’d
The new approach – Apache Hadoop
A software framework for storing , processing and
analyzing big data
• Distributed
• scalable
• fault tolerant
• open source
• Eco system
Nathan Krasney 23/6/15 27
28. What is Big Data ? Con’d
The new approach – Hadoop
Hadoop core components :
• HDFS (Hadoop Distributed File System) - store
the data on the cluster
• MapReduce - process the data on the cluster
Nathan Krasney 23/6/15 28
29. What is Big Data ? Con’d
HDFS basic concepts
• HDFS is a file system written in java
• Sit on top of native file system e.g. Linux
• storage of massive amount of data :
– scalable
– fault tolerant
– supports efficient processing with MapReduce
Nathan Krasney 23/6/15 29
30. What is Big Data ? Con’d
HDFS basic concepts
Cluster may hundreds or thousands of servers
Nathan Krasney 23/6/15 30
31. What is Big Data ? Con’d
HDFS basic concepts
How files are stored
• Data files are splited into blocks and distributed to
the data nodes(computer)
• Each block is replicated on multiple node (3 is
default)
• NameNode stores metadata
Nathan Krasney 23/6/15 31
32. What is Big Data ? Con’d
HDFS basic concepts
Nathan Krasney 23/6/15 32
33. What is Big Data ? Con’d
Get data in out of HDFS
Nathan Krasney 23/6/15 33
34. What is Big Data ? Con’d
MapReduce
MapReduce has 3 main phases :
phase 1 - The Mapper
• Each task works (typically) on one HDFS block
• Map task run (typically) on the same node where the block is stored
phase 2 - Shuffle & Sort
• sort and collect all intermediate data from all mappers
• happens after all Map tasks are completed
phase 3 - The Reducer
• operate on sorted shuffled intermediate data - previous phase
output
• produces final output
Nathan Krasney 23/6/15 34
35. What is Big Data ? Con’d
Example : counting words
Nathan Krasney 23/6/15 35
36. What is Big Data ? Con’d
Phase 1 - The mapper map the text
Nathan Krasney 23/6/15 36
37. What is Big Data ? Con’d
Phase 2 - Shuffle & Sort
Nathan Krasney 23/6/15 37
38. What is Big Data ? Con’d
Phase 3 – Reduce
Nathan Krasney 23/6/15 38
39. What is Big Data ? Con’d
It is important to understand that :
• Map tasks run in parallel - this reduce computation
time.
• Map tasks run on the machines that contains the
data so there is no network traffic issues
• Reduce also runs in parallel
Nathan Krasney 23/6/15 39
40. What is Big Data ? Con’d
Core Hadoop concepts :
• applications are written in high level languages
• nodes talk to each other as little as possible
• data is distributed in advanced
• data is replicated for increased availability and
reliability
• Hadoop is scalable and fault tolerant
Nathan Krasney 23/6/15 40
41. What is Big Data ? Con’d
Fault tolerance :
• node failure is inevitable
• what to do in this case :
– system continues to function
– master re-assign tasks to a different node
– data replication - so no lost of data
– node which recover rejoin the cluster
automatically
Nathan Krasney 23/6/15 41
42. What is Big Data ? Con’d
Scalability means
• adding more nodes is linearly proportional to
capacity
• increase load result in graceful decline in
performance and not failure
Nathan Krasney 23/6/15 42
43. What is Big Data ? Con’d
Hadoop Eco system
Nathan Krasney 23/6/15 43
44. What is Big Data ? Con’d
Hadoop Ecosystem
• querying data : Hive , Pig, Impala
• Data store : Hbase (Big table like over HDFS)
• get data into HDFS : Flume
• Schedulers (e.g. Hadoop Map/Reduce jobs, Pig
jobs): Oozie
• Machine Learning : Mahout
Nathan Krasney 23/6/15 44
45. What is Big Data ? Con’d
Who uses Hadoop
Nathan Krasney 23/6/15 45
46. What is Big Data ? Con’d
Spark
The problem : MapReduce may be slow and does only
batch processing
Solution – Spark
• Can do both batch and streaming
• Apache Spark processes data in-memory while Hadoop
MapReduce persists back to the disk after a map or
reduce action. Up to X100 better processing time
Nathan Krasney 23/6/15 46
47. What is Big Data ? Con’d
NoSQL (Not only SQL)
The problem : storage and retrieval
of unstructured data, typically huge amount of it.
The solution :
• NoSQL database
• The data structures used by NoSQL databases :
– key-value : key is the identifier
– Graph : nodes + edges to represent relationship
– document : store data as JSON document (MongoDB ,
CouchDB,..)
– …
Nathan Krasney 23/6/15 47
48. Why is the future so bright for Big Data
• IOT (Internet Of Things) will add huge amount of data
in the coming years
• Cloud allows us to save easily a lot of data
• More data is stored as time goes by on the net,
Companies , institutions,…
• Data processing abilities improves As time goes by
(Hadoop , Spark)
• the ability to store huge amount of data improves as
time goes by
• The ability to store more data + better processing leads
to smarter info that can be retrieved from the data
• Smart info is power = money
Nathan Krasney 23/6/15 48