SlideShare a Scribd company logo
BITCOIN PRICE DETECTION
WITH RANDOM FOREST
CS 556 - BIG DATA ANALYSIS TERM PROJECT -
YAKUP GORUR - S017327
OUTLINE
• MOTIVATION
• CRYPTOCURRENCY DATA
• RANDOM FOREST
• RESULT
MOTIVATION
BITCOIN
CRYPTOCURRENCY DATA
• OPEN
• HIGH
• LOW
• CLOSE
• Volume (BTC)
• Volume (Currency)
• Weighted Price
CANDLE STICK CHART
CRYPTOCURRENCY DATA
• OPEN PRICE:
The open represents the first price traded
during the candlestick.
• HIGH PRICE:
The high is the highest price traded during
the candlestick
• LOW:
The low shows the lowest price traded
during the candlestick
• CLOSE:
The close is the last price traded during the
candlestick
Image via Wikipedia
The color of the candle body indicates whether the closing price was
higher than the opening price 
green bar, called an ‘up-bar’
 red bar, called a ‘down-bar’
CANDLE STICK CHART
CRYPTOCURRENCY DATA
• Volume (BTC):
Volume, in BTC traded in the stock
market during a given measurement
interval.
• Volume (Currency):
Volume, in USD, traded on stock market
during a given measurement interval.
• Weighted Price:
 Measure of the average price
CRYPTOCURRENCY DATA
HOW BIG
WEEKLY DATA
DAILY DATA
HOURLY DATA
PER MINUTE DATA
x 7
x 24
x 60
2009 - 2018 for BITCOIN
~468 weeks
~3,285 days
~78,840 hours
~4,730,400 minutes
EVERY DAY OR EVERY WEEK
EVERY DAY
EVERY HOUR
EVERY MINUTE
PREDICTION
REAL TIME
Number of Market
~ 27000
Number of Coins
~ 5000
~12,636,000 weeks
~88 M days
~2 B hours
~120 B minutes
~60 B weeks
~440 B days
~10 T hours
~ 600 T minutes
RANDOM FOREST
• classification and regression tasks
• decision tree, the basic building
block of a random forest
• fast to build
• Fully parallelizable
RANDOM FOREST
• Each tree in a Random Forest is
trained independently, multiple trees
can be trained in parallel.
• Random Forests use a different
subsample of the data to train each
tree. Instead of replicating data
explicitly, we save memory by using
a TreePoint structure which stores
the number of replicas of each
instance in each subsample.
RANDOM FOREST IN SPARK
1. Read train data from a CSV file.
2. For n Trees call n mappers
3. Create 90% subset of the training data with replacement
4. When mapper is running it takes these data
5. After receiving data, each mapper start to build tree and
produce prediction for test dataset.
6. Pass the test data and label as key and value to Reducer
7. Reducer counts the majority label according to key.
8. Write results to output file.
RANDOM FOREST ON MAPREDUCE
BASIC MAPREDUCE STRUCTURE
DATA REPRESENTATION
RESULT
Window Size=7
RESULT
DATA TRAIN/TEST
RESULT
DATA PREDICTION
RMSE: 180
RESULT
DATA PREDICTION
“past performance is not an indicator for future outcomes”.*Finance world:
THANK YOU
- Yakup Görür

More Related Content

What's hot

What's hot (8)

Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Replication and rebuild in cStor
Replication and rebuild in cStorReplication and rebuild in cStor
Replication and rebuild in cStor
 
Heapsort 1
Heapsort 1Heapsort 1
Heapsort 1
 
Dynamo db and Cross Region Migration
Dynamo db and Cross Region MigrationDynamo db and Cross Region Migration
Dynamo db and Cross Region Migration
 
Demonstration
DemonstrationDemonstration
Demonstration
 
Introducing gluster filesystem by aditya
Introducing gluster filesystem by adityaIntroducing gluster filesystem by aditya
Introducing gluster filesystem by aditya
 
Etcd terraform by Alex Somesan
Etcd terraform by Alex SomesanEtcd terraform by Alex Somesan
Etcd terraform by Alex Somesan
 

Similar to Bitcoin Price Detection with Pyspark presentation

Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
Jan Aerts
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
Manchor Ko
 

Similar to Bitcoin Price Detection with Pyspark presentation (20)

Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Getting started with Cassandra 2.1
Getting started with Cassandra 2.1Getting started with Cassandra 2.1
Getting started with Cassandra 2.1
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

Bitcoin Price Detection with Pyspark presentation

  • 1. BITCOIN PRICE DETECTION WITH RANDOM FOREST CS 556 - BIG DATA ANALYSIS TERM PROJECT - YAKUP GORUR - S017327
  • 2. OUTLINE • MOTIVATION • CRYPTOCURRENCY DATA • RANDOM FOREST • RESULT
  • 4. BITCOIN CRYPTOCURRENCY DATA • OPEN • HIGH • LOW • CLOSE • Volume (BTC) • Volume (Currency) • Weighted Price
  • 5. CANDLE STICK CHART CRYPTOCURRENCY DATA • OPEN PRICE: The open represents the first price traded during the candlestick. • HIGH PRICE: The high is the highest price traded during the candlestick • LOW: The low shows the lowest price traded during the candlestick • CLOSE: The close is the last price traded during the candlestick Image via Wikipedia The color of the candle body indicates whether the closing price was higher than the opening price  green bar, called an ‘up-bar’  red bar, called a ‘down-bar’
  • 6. CANDLE STICK CHART CRYPTOCURRENCY DATA • Volume (BTC): Volume, in BTC traded in the stock market during a given measurement interval. • Volume (Currency): Volume, in USD, traded on stock market during a given measurement interval. • Weighted Price:  Measure of the average price
  • 7. CRYPTOCURRENCY DATA HOW BIG WEEKLY DATA DAILY DATA HOURLY DATA PER MINUTE DATA x 7 x 24 x 60 2009 - 2018 for BITCOIN ~468 weeks ~3,285 days ~78,840 hours ~4,730,400 minutes EVERY DAY OR EVERY WEEK EVERY DAY EVERY HOUR EVERY MINUTE PREDICTION REAL TIME Number of Market ~ 27000 Number of Coins ~ 5000 ~12,636,000 weeks ~88 M days ~2 B hours ~120 B minutes ~60 B weeks ~440 B days ~10 T hours ~ 600 T minutes
  • 9. • classification and regression tasks • decision tree, the basic building block of a random forest • fast to build • Fully parallelizable RANDOM FOREST
  • 10. • Each tree in a Random Forest is trained independently, multiple trees can be trained in parallel. • Random Forests use a different subsample of the data to train each tree. Instead of replicating data explicitly, we save memory by using a TreePoint structure which stores the number of replicas of each instance in each subsample. RANDOM FOREST IN SPARK
  • 11. 1. Read train data from a CSV file. 2. For n Trees call n mappers 3. Create 90% subset of the training data with replacement 4. When mapper is running it takes these data 5. After receiving data, each mapper start to build tree and produce prediction for test dataset. 6. Pass the test data and label as key and value to Reducer 7. Reducer counts the majority label according to key. 8. Write results to output file. RANDOM FOREST ON MAPREDUCE BASIC MAPREDUCE STRUCTURE
  • 15. RESULT DATA PREDICTION “past performance is not an indicator for future outcomes”.*Finance world:
  • 16. THANK YOU - Yakup Görür