Big Data
Dipl. Inform.(FH) Jony Sugianto, M. Comp. Sc.
Hp:0838-98355491
WA:0812-13086659
Email:jonysugianto@gmail.com
Agenda
● What is Big Data?
● Analytic
● Big Data Platforms
● Questions
What is Big Data?
● The Basic idea behind the phrase Big Data is that
everything we do is increasingly leaving a digital trace(or
data), which we(and others) can use and analyse
● Big Data therefore refers to our ability to make use of the
ever increasing volumes of data
● Big Data is not about the size of the data, it's about the
value within the data
Datafication of the world
● Activities
- Web Browser
- Credit Cards
- E-Commerce
● Conversations
- WhatsApp
- Email
- Twitter
● Photos/Videos
- Instagram
- You Tube
● Sensors
- Gps
● Etc...
Turning Big Data into Value
Datafication of
our world
● Activity
● Conversation
● Sensors
● Photo/Video
● Etc...
Analysing Big
Data
● Text Analytics
● Sentiment
Analysis
● Movement
Analytics
● Face/Voice
Recognition
● Etc...
Value
Webdata
● Log Data(all user)
- Anonymous ID from Cookie Data
- LoginID (if exist)
- ArticleId
- Kanal / Category
- Browser
- IP
- Etc...
● Registered User Data(10 %)
- Login ID
- Name
- Age
- Gender
- Education
- Etc...
Valuable data
● User activness
● User interest based on reading behaviour
● Personal Profile for all user
Compute
Compute
Why use the UA 2 ?
User Activness and User Interest
How to update the User activness?
How to update the User activness?
New-UA=w_history * UA-sofar + w_current * UA-Per-Minggu
w_history=0.75
w_current=0.25
Asigning personal profile
Final data
How to define the similarity?
● Linear |x1 – x2|
● Square (x1 – x2)^2
● Exponential 10^f(|x1-x2|)
Complexity Analysis
● Assume 30.000.000 click a day
● A week: 210.000.000 click
● Size log entry: 1 kb
● Total size: 210.000.000.000 byte = 210 Gb
Complexity Analysis
● All User : 10.000.000
● Loginuser: 1.000.000
● Comparison per second per CPU: 1.000.000
● Total Comparison: 9.000.000.000.000
● Total Time: 9.000.000 second=104 hari
Big Data Platforms
What is the different?
What is Hadoop?
● Hadoop:
an open-source framework that supports data-intensive
distributed applications, licensed under apache v2 license
● Goals:
- Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets
- High scalability and availability
- Use commodity Hardware
- Fault-tolerance
- Move computation rather than data
Hadoop Components
● Hadoop Distributed File System(HDFS)
A distributed file system that provides high-throughput access to
application data
● Hadoop YARN
A framework for job scheduling and cluster resource
management
● Hadoop MapReduce
A Yarn-based system for parallel processing of large data sets
What is Hive?
● Hive is a data warehouse infrastructure built on top of
Hadoop
● Hive stored data in the HDFS
● Hive compile SQL Queries into MapReduce jobs
Example Hive Script
What is Pig?
● Pig is a platform for analyzing large data sets that consist
of a high-level language for expressing data analysis
programs
● Pig generates and compiles a MapReduce program on the
fly
Example Pig Script
What is Spark?
● Fast and general purpose cluster computing system
● 10x(on disk) – 100x(in-memory) faster than Hadoop
MapReduce
● Provides high level APIs in
-Scala
-Java
-Python
● Can be deployed through Apache Mesos, Apache Hadoop
via YARN, or Spark's cluster manager
Resilient Distributed Datasets
● Written in scala
● Fundamental Unit of data in spark
● Distributed collection of object
● Resilient-Ability to recompute missing partions(node failure)
● Distributed-Split across multiple partions
● Dataset-Can contains any type, Scala/Java/Python Object or User
defined object
● Operations
-Transformations(map, filter, groupBy,...)
-Actions(count, collect, save, ...)
Spark Example
// Spark wordcount
object WordCount {
def main(args: Array[String]) {
val env = new SparkContext("local","wordCount")
val data = List("hi","how are you","hi")
val dataSet = env.parallelize(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val sum = mappedWords.reduceByKey(_+_)
println(sum.collect())
}
}
What is Flink?
● Written in java
● An open source platform for distributed stream and batch
data processing
● Several APIs in Java/Scala/Python
-DataSet API – Batch processing
-DataStream API – Stream processing
-Table API – Relational Queries
Flink Example
// Flink wordcount
object WordCount {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi","how are you","hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("s+"))
val mappedWords = words.map(value => (value,1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
println(sum.collect())
}
}
Question ?

What is Big Data?

  • 1.
    Big Data Dipl. Inform.(FH)Jony Sugianto, M. Comp. Sc. Hp:0838-98355491 WA:0812-13086659 Email:jonysugianto@gmail.com
  • 2.
    Agenda ● What isBig Data? ● Analytic ● Big Data Platforms ● Questions
  • 3.
    What is BigData? ● The Basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace(or data), which we(and others) can use and analyse ● Big Data therefore refers to our ability to make use of the ever increasing volumes of data ● Big Data is not about the size of the data, it's about the value within the data
  • 4.
    Datafication of theworld ● Activities - Web Browser - Credit Cards - E-Commerce ● Conversations - WhatsApp - Email - Twitter ● Photos/Videos - Instagram - You Tube ● Sensors - Gps ● Etc...
  • 5.
    Turning Big Datainto Value Datafication of our world ● Activity ● Conversation ● Sensors ● Photo/Video ● Etc... Analysing Big Data ● Text Analytics ● Sentiment Analysis ● Movement Analytics ● Face/Voice Recognition ● Etc... Value
  • 6.
    Webdata ● Log Data(alluser) - Anonymous ID from Cookie Data - LoginID (if exist) - ArticleId - Kanal / Category - Browser - IP - Etc... ● Registered User Data(10 %) - Login ID - Name - Age - Gender - Education - Etc...
  • 7.
    Valuable data ● Useractivness ● User interest based on reading behaviour ● Personal Profile for all user
  • 8.
  • 9.
  • 10.
  • 11.
    User Activness andUser Interest
  • 12.
    How to updatethe User activness?
  • 13.
    How to updatethe User activness? New-UA=w_history * UA-sofar + w_current * UA-Per-Minggu w_history=0.75 w_current=0.25
  • 14.
  • 15.
  • 16.
    How to definethe similarity? ● Linear |x1 – x2| ● Square (x1 – x2)^2 ● Exponential 10^f(|x1-x2|)
  • 17.
    Complexity Analysis ● Assume30.000.000 click a day ● A week: 210.000.000 click ● Size log entry: 1 kb ● Total size: 210.000.000.000 byte = 210 Gb
  • 18.
    Complexity Analysis ● AllUser : 10.000.000 ● Loginuser: 1.000.000 ● Comparison per second per CPU: 1.000.000 ● Total Comparison: 9.000.000.000.000 ● Total Time: 9.000.000 second=104 hari
  • 19.
  • 20.
    What is thedifferent?
  • 21.
    What is Hadoop? ●Hadoop: an open-source framework that supports data-intensive distributed applications, licensed under apache v2 license ● Goals: - Abstract and facilitate the storage and processing of large and/or rapidly growing data sets - High scalability and availability - Use commodity Hardware - Fault-tolerance - Move computation rather than data
  • 22.
    Hadoop Components ● HadoopDistributed File System(HDFS) A distributed file system that provides high-throughput access to application data ● Hadoop YARN A framework for job scheduling and cluster resource management ● Hadoop MapReduce A Yarn-based system for parallel processing of large data sets
  • 23.
    What is Hive? ●Hive is a data warehouse infrastructure built on top of Hadoop ● Hive stored data in the HDFS ● Hive compile SQL Queries into MapReduce jobs
  • 24.
  • 25.
    What is Pig? ●Pig is a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs ● Pig generates and compiles a MapReduce program on the fly
  • 26.
  • 27.
    What is Spark? ●Fast and general purpose cluster computing system ● 10x(on disk) – 100x(in-memory) faster than Hadoop MapReduce ● Provides high level APIs in -Scala -Java -Python ● Can be deployed through Apache Mesos, Apache Hadoop via YARN, or Spark's cluster manager
  • 28.
    Resilient Distributed Datasets ●Written in scala ● Fundamental Unit of data in spark ● Distributed collection of object ● Resilient-Ability to recompute missing partions(node failure) ● Distributed-Split across multiple partions ● Dataset-Can contains any type, Scala/Java/Python Object or User defined object ● Operations -Transformations(map, filter, groupBy,...) -Actions(count, collect, save, ...)
  • 29.
    Spark Example // Sparkwordcount object WordCount { def main(args: Array[String]) { val env = new SparkContext("local","wordCount") val data = List("hi","how are you","hi") val dataSet = env.parallelize(data) val words = dataSet.flatMap(value => value.split("s+")) val mappedWords = words.map(value => (value,1)) val sum = mappedWords.reduceByKey(_+_) println(sum.collect()) } }
  • 30.
    What is Flink? ●Written in java ● An open source platform for distributed stream and batch data processing ● Several APIs in Java/Scala/Python -DataSet API – Batch processing -DataStream API – Stream processing -Table API – Relational Queries
  • 31.
    Flink Example // Flinkwordcount object WordCount { def main(args: Array[String]) { val env = ExecutionEnvironment.getExecutionEnvironment val data = List("hi","how are you","hi") val dataSet = env.fromCollection(data) val words = dataSet.flatMap(value => value.split("s+")) val mappedWords = words.map(value => (value,1)) val grouped = mappedWords.groupBy(0) val sum = grouped.sum(1) println(sum.collect()) } }
  • 32.