SlideShare a Scribd company logo
1 of 27
Big Data Processing – An
introduction (and how it’s
implemented at Detik)
Jony Sugianto
IT R&D Engineer at Detik.com
Saturday, November, 1rd 2014
Agenda
What is Big Data?
Examples of Big Data
Big Data Elements
Big Data Ecosystems
Hadoop Overview
Other Big Data Tools
Big Data Processing at Detik
QA Session
What is Big Data?
Big Data refers to a collection of data sets so large and
complex, it’s impossible to process them with the usual
databases and tools.
Because of its size and associated numbers, Big Data is
hard to capture, store, search, share, analyze and visualize.
What is Big Data?
Big Data spans four dimensions[1]:
Volume: Enterprises are awash with ever-growing data easily
amassing terabytes—even petabytes—of information.
Velocity: Sometimes 2 minutes is too late. For time-sensitive
processes such as catching fraud, big data must be used as it
streams into your enterprise in order to maximize its value.
Variety: Big data is any type of data - structured and
unstructured data such as text, sensor data, audio, video, click
streams, log files and more.
Veracity: Establishing trust in big data presents a huge
challenge as the variety and number of sources grows.
Examples of Big Data
10,000 payment card transactions are made every
second around the world.
Walmart handles more than 1 million customer
transactions an hour.
340 million tweets are sent per day. That's nearly 4,000
tweets per second.
Facebook has more than 901 million active users
generating social interaction data.
Detik?
How Big is Big Data?
The definition of “Big Data” varies greatly depending
upon which part of the “animal” you touch, and where your
interests lie
Big Data Elements
Big Data Ecosystems
Hadoop Overview
Apache™ Hadoop® is an open source software project
that enables the distributed processing of large data sets
across clusters of commodity servers.
It is designed to scale up from a single server to thousands
of machines, with a very high degree of fault tolerance.
Hadoop Overview: Components
Hadoop Overview: Components
NameNode: The master of HDFS that directs the slave
DataNode daemons to perform the low-leve I/O tasks
DataNode: The slave of HDFS that perform the grunt work
of the distributed filesystem (read and write HDFS blocks to
actual files on the local file system)
Secondary NameNode: Assistant daemon for monitoring
the state of the cluster HDFS. It communicates with the
NameNode to take snapshots of the HDFS metadata
Hadoop Overview: Components
JobTracker: Determines the execution plan by
determining which files to process, assign nodes to different
tasks, and monitors all tasks as they’re running
TaskTracker: Manages the execution of individual tasks on
each slave node
Hadoop Overview: MR Process
Hadoop Overview: Example
Other Big Data Tools: Hive
Hive allows you to define a structure for your unstructured
big data, simplifying the process of performing analysis and
queries by introducing a familiar, SQL-like language called
HiveQL
Hive is for data analysts familiar with SQL who need to do
ad-hoc queries, summarization and data analysis on their
HDFS data
Other Big Data Tools: Pig
Pig is an extension of Hadoop that simplifies the ability to
query large HDFS datasets
Pig was created at Yahoo! to make it easier to analyze
the data in your HDFS without the complexities of writing a
traditional MapReduce program
Pig is made up of two main components:
A SQL-like data processing language called Pig Latin
A compiler that compiles and runs Pig Latin scripts
With Pig, you can develop MapReduce jobs with a few
lines of Pig Latin
Other Big Data Tools: Pig vs Hive
Pig and Hive work well together
Hive is a good choice:
when you want to query the data
when you need an answer to a specific questions
if you are familiar with SQL
Pig is a good choice:
for ETL (Extract -> Transform -> Load)
preparing your data so that it is easier to analyze
when you have a long series of steps to perform
At Detik, we use both Pig and Hive together
Other Big Data Tools: FlumeNG
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large
amounts of log data
Big Data Processing at Detik
Most Popular
Generates most popular articles within 15 minutes timespan
Employ weightings to balance computation
4 nodes (1 Master + 3 Slaves)
± 2 Gb / 15 mins
Hadoop is used to store and parse Internet log files
Only one Hadoop job for each execution
Akka is used to download Internet log files in parallel and
distribute work loads evenly to each slaves
Big Data Processing at Detik
Detik Analytics:
Tracking information about web access (similar with Google
Analytics/Urchin)
 Still in development phase
3 Nodes (1 Master + 2 Slaves)
Hadoop is used to store the input Internet log data and the
output Internet log data
Akka is used to manage work balance
Hive is used to generate intermediate tables for calculation
process and for calculating some rudimentary metrics
Pig is used to calculate a more complex metrics
Big Data Processing at Detik
Example Analytics Metric:
Exit Rate: For all pageviews to the page, the exit rate is the
percentage that were the last in the session.
Bounce Rate: For all sessions that start with the page, bounce
rate is the percentage that were the only one of the session.
The bounce rate calculation for a page is based only on visits
that start with that page.
Currently DetikForum has 7,756,010 number of processed
records
Big Data Processing at Detik
Big Data Processing at Detik
After less than 2 minutes processing time…
Big Data Processing at Detik
Question & Answer?
Thank You!
References
[1] http://www-01.ibm.com/software/data/bigdata/
[2] http://www.sas.com/big-data/

More Related Content

What's hot

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copySharon Moses
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Introduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataIntroduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataRamsay Key
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 

What's hot (20)

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop
HadoopHadoop
Hadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Introduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataIntroduction to Hadoop and Big-Data
Introduction to Hadoop and Big-Data
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Viewers also liked

Introduction Laravel By Hasannuh Bz
Introduction Laravel By Hasannuh BzIntroduction Laravel By Hasannuh Bz
Introduction Laravel By Hasannuh Bzk4ndar
 
Composer Panada Conference 2014 by Mulia Nasution
Composer Panada Conference 2014 by Mulia NasutionComposer Panada Conference 2014 by Mulia Nasution
Composer Panada Conference 2014 by Mulia Nasutionk4ndar
 
Composer - Panada Conference 2014
Composer - Panada Conference 2014Composer - Panada Conference 2014
Composer - Panada Conference 2014Mulia Nasution
 
Git for development and deployment By Azhari Harahap
Git for development and deployment By Azhari HarahapGit for development and deployment By Azhari Harahap
Git for development and deployment By Azhari Harahapk4ndar
 
Panada: An Introduction by Iskandar Soesman
Panada: An Introduction by Iskandar SoesmanPanada: An Introduction by Iskandar Soesman
Panada: An Introduction by Iskandar Soesmank4ndar
 
Optimize Web Application Infrastructure by Rizki Nanda Agam
Optimize Web Application Infrastructure by Rizki Nanda Agam Optimize Web Application Infrastructure by Rizki Nanda Agam
Optimize Web Application Infrastructure by Rizki Nanda Agam k4ndar
 
Yii2 by Peter Jack Kambey
Yii2 by Peter Jack KambeyYii2 by Peter Jack Kambey
Yii2 by Peter Jack Kambeyk4ndar
 

Viewers also liked (7)

Introduction Laravel By Hasannuh Bz
Introduction Laravel By Hasannuh BzIntroduction Laravel By Hasannuh Bz
Introduction Laravel By Hasannuh Bz
 
Composer Panada Conference 2014 by Mulia Nasution
Composer Panada Conference 2014 by Mulia NasutionComposer Panada Conference 2014 by Mulia Nasution
Composer Panada Conference 2014 by Mulia Nasution
 
Composer - Panada Conference 2014
Composer - Panada Conference 2014Composer - Panada Conference 2014
Composer - Panada Conference 2014
 
Git for development and deployment By Azhari Harahap
Git for development and deployment By Azhari HarahapGit for development and deployment By Azhari Harahap
Git for development and deployment By Azhari Harahap
 
Panada: An Introduction by Iskandar Soesman
Panada: An Introduction by Iskandar SoesmanPanada: An Introduction by Iskandar Soesman
Panada: An Introduction by Iskandar Soesman
 
Optimize Web Application Infrastructure by Rizki Nanda Agam
Optimize Web Application Infrastructure by Rizki Nanda Agam Optimize Web Application Infrastructure by Rizki Nanda Agam
Optimize Web Application Infrastructure by Rizki Nanda Agam
 
Yii2 by Peter Jack Kambey
Yii2 by Peter Jack KambeyYii2 by Peter Jack Kambey
Yii2 by Peter Jack Kambey
 

Similar to Big data-at-detik

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 

Similar to Big data-at-detik (20)

Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
paper
paperpaper
paper
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop_Presentation
Hadoop_PresentationHadoop_Presentation
Hadoop_Presentation
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
INTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATAINTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATA
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
hadoop resume
hadoop resumehadoop resume
hadoop resume
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 

Big data-at-detik

  • 1. Big Data Processing – An introduction (and how it’s implemented at Detik) Jony Sugianto IT R&D Engineer at Detik.com Saturday, November, 1rd 2014
  • 2. Agenda What is Big Data? Examples of Big Data Big Data Elements Big Data Ecosystems Hadoop Overview Other Big Data Tools Big Data Processing at Detik QA Session
  • 3. What is Big Data? Big Data refers to a collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools. Because of its size and associated numbers, Big Data is hard to capture, store, search, share, analyze and visualize.
  • 4. What is Big Data? Big Data spans four dimensions[1]: Volume: Enterprises are awash with ever-growing data easily amassing terabytes—even petabytes—of information. Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. Veracity: Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
  • 5. Examples of Big Data 10,000 payment card transactions are made every second around the world. Walmart handles more than 1 million customer transactions an hour. 340 million tweets are sent per day. That's nearly 4,000 tweets per second. Facebook has more than 901 million active users generating social interaction data. Detik?
  • 6.
  • 7. How Big is Big Data? The definition of “Big Data” varies greatly depending upon which part of the “animal” you touch, and where your interests lie
  • 10. Hadoop Overview Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
  • 12. Hadoop Overview: Components NameNode: The master of HDFS that directs the slave DataNode daemons to perform the low-leve I/O tasks DataNode: The slave of HDFS that perform the grunt work of the distributed filesystem (read and write HDFS blocks to actual files on the local file system) Secondary NameNode: Assistant daemon for monitoring the state of the cluster HDFS. It communicates with the NameNode to take snapshots of the HDFS metadata
  • 13. Hadoop Overview: Components JobTracker: Determines the execution plan by determining which files to process, assign nodes to different tasks, and monitors all tasks as they’re running TaskTracker: Manages the execution of individual tasks on each slave node
  • 16. Other Big Data Tools: Hive Hive allows you to define a structure for your unstructured big data, simplifying the process of performing analysis and queries by introducing a familiar, SQL-like language called HiveQL Hive is for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis on their HDFS data
  • 17. Other Big Data Tools: Pig Pig is an extension of Hadoop that simplifies the ability to query large HDFS datasets Pig was created at Yahoo! to make it easier to analyze the data in your HDFS without the complexities of writing a traditional MapReduce program Pig is made up of two main components: A SQL-like data processing language called Pig Latin A compiler that compiles and runs Pig Latin scripts With Pig, you can develop MapReduce jobs with a few lines of Pig Latin
  • 18. Other Big Data Tools: Pig vs Hive Pig and Hive work well together Hive is a good choice: when you want to query the data when you need an answer to a specific questions if you are familiar with SQL Pig is a good choice: for ETL (Extract -> Transform -> Load) preparing your data so that it is easier to analyze when you have a long series of steps to perform At Detik, we use both Pig and Hive together
  • 19. Other Big Data Tools: FlumeNG Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
  • 20. Big Data Processing at Detik Most Popular Generates most popular articles within 15 minutes timespan Employ weightings to balance computation 4 nodes (1 Master + 3 Slaves) ± 2 Gb / 15 mins Hadoop is used to store and parse Internet log files Only one Hadoop job for each execution Akka is used to download Internet log files in parallel and distribute work loads evenly to each slaves
  • 21. Big Data Processing at Detik Detik Analytics: Tracking information about web access (similar with Google Analytics/Urchin)  Still in development phase 3 Nodes (1 Master + 2 Slaves) Hadoop is used to store the input Internet log data and the output Internet log data Akka is used to manage work balance Hive is used to generate intermediate tables for calculation process and for calculating some rudimentary metrics Pig is used to calculate a more complex metrics
  • 22. Big Data Processing at Detik Example Analytics Metric: Exit Rate: For all pageviews to the page, the exit rate is the percentage that were the last in the session. Bounce Rate: For all sessions that start with the page, bounce rate is the percentage that were the only one of the session. The bounce rate calculation for a page is based only on visits that start with that page. Currently DetikForum has 7,756,010 number of processed records
  • 25. After less than 2 minutes processing time… Big Data Processing at Detik