BITCOIN PRICE DETECTION
WITH RANDOM FOREST
CS 556 - BIG DATA ANALYSIS TERM PROJECT -
YAKUP GORUR - S017327
OUTLINE
• MOTIVATION
• CRYPTOCURRENCY DATA
• RANDOM FOREST
• RESULT
MOTIVATION
BITCOIN
CRYPTOCURRENCY DATA
• OPEN
• HIGH
• LOW
• CLOSE
• Volume (BTC)
• Volume (Currency)
• Weighted Price
CANDLE STICK CHART
CRYPTOCURRENCY DATA
• OPEN PRICE:
The open represents the first price traded
during the candlestick.
• HIGH PRICE:
The high is the highest price traded during
the candlestick
• LOW:
The low shows the lowest price traded
during the candlestick
• CLOSE:
The close is the last price traded during the
candlestick
Image via Wikipedia
The color of the candle body indicates whether the closing price was
higher than the opening price 
green bar, called an ‘up-bar’
 red bar, called a ‘down-bar’
CANDLE STICK CHART
CRYPTOCURRENCY DATA
• Volume (BTC):
Volume, in BTC traded in the stock
market during a given measurement
interval.
• Volume (Currency):
Volume, in USD, traded on stock market
during a given measurement interval.
• Weighted Price:
 Measure of the average price
CRYPTOCURRENCY DATA
HOW BIG
WEEKLY DATA
DAILY DATA
HOURLY DATA
PER MINUTE DATA
x 7
x 24
x 60
2009 - 2018 for BITCOIN
~468 weeks
~3,285 days
~78,840 hours
~4,730,400 minutes
EVERY DAY OR EVERY WEEK
EVERY DAY
EVERY HOUR
EVERY MINUTE
PREDICTION
REAL TIME
Number of Market
~ 27000
Number of Coins
~ 5000
~12,636,000 weeks
~88 M days
~2 B hours
~120 B minutes
~60 B weeks
~440 B days
~10 T hours
~ 600 T minutes
RANDOM FOREST
• classification and regression tasks
• decision tree, the basic building
block of a random forest
• fast to build
• Fully parallelizable
RANDOM FOREST
• Each tree in a Random Forest is
trained independently, multiple trees
can be trained in parallel.
• Random Forests use a different
subsample of the data to train each
tree. Instead of replicating data
explicitly, we save memory by using
a TreePoint structure which stores
the number of replicas of each
instance in each subsample.
RANDOM FOREST IN SPARK
1. Read train data from a CSV file.
2. For n Trees call n mappers
3. Create 90% subset of the training data with replacement
4. When mapper is running it takes these data
5. After receiving data, each mapper start to build tree and
produce prediction for test dataset.
6. Pass the test data and label as key and value to Reducer
7. Reducer counts the majority label according to key.
8. Write results to output file.
RANDOM FOREST ON MAPREDUCE
BASIC MAPREDUCE STRUCTURE
DATA REPRESENTATION
RESULT
Window Size=7
RESULT
DATA TRAIN/TEST
RESULT
DATA PREDICTION
RMSE: 180
RESULT
DATA PREDICTION
“past performance is not an indicator for future outcomes”.*Finance world:
THANK YOU
- Yakup Görür

Bitcoin Price Detection with Pyspark presentation

  • 1.
    BITCOIN PRICE DETECTION WITHRANDOM FOREST CS 556 - BIG DATA ANALYSIS TERM PROJECT - YAKUP GORUR - S017327
  • 2.
    OUTLINE • MOTIVATION • CRYPTOCURRENCYDATA • RANDOM FOREST • RESULT
  • 3.
  • 4.
    BITCOIN CRYPTOCURRENCY DATA • OPEN •HIGH • LOW • CLOSE • Volume (BTC) • Volume (Currency) • Weighted Price
  • 5.
    CANDLE STICK CHART CRYPTOCURRENCYDATA • OPEN PRICE: The open represents the first price traded during the candlestick. • HIGH PRICE: The high is the highest price traded during the candlestick • LOW: The low shows the lowest price traded during the candlestick • CLOSE: The close is the last price traded during the candlestick Image via Wikipedia The color of the candle body indicates whether the closing price was higher than the opening price  green bar, called an ‘up-bar’  red bar, called a ‘down-bar’
  • 6.
    CANDLE STICK CHART CRYPTOCURRENCYDATA • Volume (BTC): Volume, in BTC traded in the stock market during a given measurement interval. • Volume (Currency): Volume, in USD, traded on stock market during a given measurement interval. • Weighted Price:  Measure of the average price
  • 7.
    CRYPTOCURRENCY DATA HOW BIG WEEKLYDATA DAILY DATA HOURLY DATA PER MINUTE DATA x 7 x 24 x 60 2009 - 2018 for BITCOIN ~468 weeks ~3,285 days ~78,840 hours ~4,730,400 minutes EVERY DAY OR EVERY WEEK EVERY DAY EVERY HOUR EVERY MINUTE PREDICTION REAL TIME Number of Market ~ 27000 Number of Coins ~ 5000 ~12,636,000 weeks ~88 M days ~2 B hours ~120 B minutes ~60 B weeks ~440 B days ~10 T hours ~ 600 T minutes
  • 8.
  • 9.
    • classification andregression tasks • decision tree, the basic building block of a random forest • fast to build • Fully parallelizable RANDOM FOREST
  • 10.
    • Each treein a Random Forest is trained independently, multiple trees can be trained in parallel. • Random Forests use a different subsample of the data to train each tree. Instead of replicating data explicitly, we save memory by using a TreePoint structure which stores the number of replicas of each instance in each subsample. RANDOM FOREST IN SPARK
  • 11.
    1. Read traindata from a CSV file. 2. For n Trees call n mappers 3. Create 90% subset of the training data with replacement 4. When mapper is running it takes these data 5. After receiving data, each mapper start to build tree and produce prediction for test dataset. 6. Pass the test data and label as key and value to Reducer 7. Reducer counts the majority label according to key. 8. Write results to output file. RANDOM FOREST ON MAPREDUCE BASIC MAPREDUCE STRUCTURE
  • 12.
  • 13.
  • 14.
  • 15.
    RESULT DATA PREDICTION “past performanceis not an indicator for future outcomes”.*Finance world:
  • 16.