What is Big Data?

Big Data
Dipl. Inform.(FH) Jony Sugianto, M. Comp. Sc.
Hp:0838-98355491
WA:0812-13086659
Email:jonysugianto@gmail.com

Agenda
● What is Big Data?
● Analytic
● Big Data Platforms
● Questions

What is Big Data?
● The Basic idea behind the phrase Big Data is that
everything we do is increasingly leaving a digital trace(or
data), which we(and others) can use and analyse
● Big Data therefore refers to our ability to make use of the
ever increasing volumes of data
● Big Data is not about the size of the data, it's about the
value within the data

Datafication of the world
● Activities
- Web Browser
- Credit Cards
- E-Commerce
● Conversations
- WhatsApp
- Email
- Twitter
● Photos/Videos
- Instagram
- You Tube
● Sensors
- Gps
● Etc...

Turning Big Data into Value
Datafication of
our world
● Activity
● Conversation
● Sensors
● Photo/Video
● Etc...
Analysing Big
Data
● Text Analytics
● Sentiment
Analysis
● Movement
Analytics
● Face/Voice
Recognition
● Etc...
Value

Webdata
● Log Data(all user)
- Anonymous ID from Cookie Data
- LoginID (if exist)
- ArticleId
- Kanal / Category
- Browser
- IP
- Etc...
● Registered User Data(10 %)
- Login ID
- Name
- Age
- Gender
- Education
- Etc...

Valuable data
● User activness
● User interest based on reading behaviour
● Personal Profile for all user

User Activness and User Interest

How to update the User activness?

How to update the User activness?
New-UA=w_history * UA-sofar + w_current * UA-Per-Minggu
w_history=0.75
w_current=0.25

How to define the similarity?
● Linear |x1 – x2|
● Square (x1 – x2)^2
● Exponential 10^f(|x1-x2|)

Complexity Analysis
● Assume 30.000.000 click a day
● A week: 210.000.000 click
● Size log entry: 1 kb
● Total size: 210.000.000.000 byte = 210 Gb

Complexity Analysis
● All User : 10.000.000
● Loginuser: 1.000.000
● Comparison per second per CPU: 1.000.000
● Total Comparison: 9.000.000.000.000
● Total Time: 9.000.000 second=104 hari

What is Hadoop?
● Hadoop:
an open-source framework that supports data-intensive
distributed applications, licensed under apache v2 license
● Goals:
- Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets
- High scalability and availability
- Use commodity Hardware
- Fault-tolerance
- Move computation rather than data

Hadoop Components
● Hadoop Distributed File System(HDFS)
A distributed file system that provides high-throughput access to
application data
● Hadoop YARN
A framework for job scheduling and cluster resource
management
● Hadoop MapReduce
A Yarn-based system for parallel processing of large data sets

What is Hive?
● Hive is a data warehouse infrastructure built on top of
Hadoop
● Hive stored data in the HDFS
● Hive compile SQL Queries into MapReduce jobs

What is Pig?
● Pig is a platform for analyzing large data sets that consist
of a high-level language for expressing data analysis
programs
● Pig generates and compiles a MapReduce program on the
fly

What is Spark?
● Fast and general purpose cluster computing system
● 10x(on disk) – 100x(in-memory) faster than Hadoop
MapReduce
● Provides high level APIs in
-Scala
-Java
-Python
● Can be deployed through Apache Mesos, Apache Hadoop
via YARN, or Spark's cluster manager

Resilient Distributed Datasets
● Written in scala
● Fundamental Unit of data in spark
● Distributed collection of object
● Resilient-Ability to recompute missing partions(node failure)
● Distributed-Split across multiple partions
● Dataset-Can contains any type, Scala/Java/Python Object or User
defined object
● Operations
-Transformations(map, filter, groupBy,...)
-Actions(count, collect, save, ...)

Spark Example
// Spark wordcount
object WordCount {
def main(args: Array[String]) {
val env = new SparkContext("local","wordCount")
val data = List("hi","how are you","hi")
val dataSet = env.parallelize(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val sum = mappedWords.reduceByKey(_+_)
println(sum.collect())
}
}

What is Flink?
● Written in java
● An open source platform for distributed stream and batch
data processing
● Several APIs in Java/Scala/Python
-DataSet API – Batch processing
-DataStream API – Stream processing
-Table API – Relational Queries

Flink Example
// Flink wordcount
object WordCount {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi","how are you","hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
println(sum.collect())
}
}

What is Big Data?

More Related Content

What's hot

Viewers also liked

Similar to What is Big Data?

More from CodePolitan

Recently uploaded

What is Big Data?