The document is a presentation by Jongwook Woo from the High-Performance Information Computing Center (HiPIC) at California State University Los Angeles given on February 25, 2017 at the SWRC conference in San Diego, CA. It discusses big data trends with open platforms and provides information on Spark, Hadoop, open data, use cases, and the future of big data. Specifically, it summarizes Jongwook Woo's background and experience, describes what big data is and how Spark improves on Hadoop MapReduce, discusses how Spark can integrate with Hadoop ecosystems, and provides examples of analyzing local business data using Spark.
Robin Bloor and Mark Madsen offer their theories on where the rapidly-changing database market stands today: What’s new? What’s standard? What is the trajectory of this evolving market? Each Analyst will present for 10-15 minutes, then will engage in a dialogue with the moderator and attendees.
The webcast audio and video archive can be found at https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=4695777&rKey=4b284990a1db4ec0
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
Robin Bloor and Mark Madsen offer their theories on where the rapidly-changing database market stands today: What’s new? What’s standard? What is the trajectory of this evolving market? Each Analyst will present for 10-15 minutes, then will engage in a dialogue with the moderator and attendees.
The webcast audio and video archive can be found at https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=4695777&rKey=4b284990a1db4ec0
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
This is a presentation about big data with Java. In those slides, you can find why big data is so important and some of the tools that are used for creating big data applications like Apache Hadoop, Apache Spark, Apache Kafka and etc.
To develop a system which will assist us to determine the revenue generated by students.
Examining the relationship between new student enrollments and institutional income at public colleges, universities and professional organizations in the US.
Bigger, faster, and cloudier: that’s where big data is headed in 2016. More people are doing more things faster with their data, but the details of how continue to evolve. Get up to speed on the latest trends in big data.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
This is a presentation about big data with Java. In those slides, you can find why big data is so important and some of the tools that are used for creating big data applications like Apache Hadoop, Apache Spark, Apache Kafka and etc.
To develop a system which will assist us to determine the revenue generated by students.
Examining the relationship between new student enrollments and institutional income at public colleges, universities and professional organizations in the US.
Bigger, faster, and cloudier: that’s where big data is headed in 2016. More people are doing more things faster with their data, but the details of how continue to evolve. Get up to speed on the latest trends in big data.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Fleet Management driver safety application spotlight focusing on driver behavior, mobile applications, mobile workforce, Hours of Service, ELD, Telematics and GPS applications
The program is designed to undrerstand the chrmistry behind textile testing and processing.
The program is useful for lab chemist, supervisor and Testing executive.
An invited talk I gave at the "Zurich Spark Meetup" in July 2016. Talking about the Apache Spark trends I observed at Spark Summit 2016, as well as some personal insights.
Big Data Analysis and Industrial Approach Using Spark. This was presented in the US-Korea Conference on Science, Technology and Entrepreneurship held in Atlanta GA on 31st of July 2015. This presentation covers an example of big data analysis of Airline Data Set using Hive.
Big Data and Advanced Data Intensive ComputingJongwook Woo
MapReduce is not working well at real time processing and iterative algorithm, which are mostly for machine learning and graph algorithms. This slide shows Spark, Giraph and Hadoop use cases in Science not in Business.
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
Big Data has been popular as Data becomes tera-/peta-bytes and un-/semi-structured. This slide illustrates the fundamental of Big Data, especially Hadoop solutions. Besides, it introduces some use cases and the way to learn Hadoop technology.
Big Data and Data Intensive Computing: Use CasesJongwook Woo
This invited talk was held by LG Data Mining Lab at LG R&D center, Woomyun-dong, Seoul, Korea. Introduces the emerging Hadoop ecosystems: Giraph, Spark, Shark, Flume and the use cases using Big Data in Korea and US. And, illustrates the importance of taking training.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
This paper compares the performance of scalable predictive analysis models using XGBoost in Big Data. The performance measurement is based on the training computing time and accuracy with AUR and Precision of a model. We developed XGBoost classification models with Airbnb listing dataset that predict the recommendation of the listings. The models are built in PySpark Rapids, BigDL, and H2O Sparkling with CPU and GPU on AWS EMR. We observed that BigDL with GPU is 25 – 50% faster training time than other platforms. H2O Sparkling has 5 - 7% better AUC and 0.7% better Precision than others.
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
The history and the latest trend of Big Data and Scalable Predictive Analysis for large scale data set using Distributed Machine Learning and Deep Learning with GPUs in Spark and Rapids; Invited talk at IS department of Yonsei University, Korea
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and how predictive analysis should be presented in Business using Big Data & AI.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
Traffic Data Analysis and Prediction using Big DataJongwook Woo
- Denser traffic on Freeways 101, 405, 10
- Rush hours from 7 am to 9 am produce a lot of traffic, the heaviest traffic time start from 3pm and gets better after 6pm.
- Major areas of traffic in DTLA, Santa Monica, Hollywood
- More insights can be found with bigger dataset using this framework for analysis of traffic
- Using such data and platform can also give an opportunity to predict traffic congestions. Prediction can be performed using machine learning algorithm – Decision Forest with the accuracy of 83% for predicting the heaviest traffic jam.
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
South Korea historians trained under Imperial Japan have believe that the tombs in Pyungyang belong to the Chinese Han. Dr Moon points out that the tombs have the similar remains to the northern nomadic, who might be the Hun/HyoongNo. He provides many evidence why it should not belong to the Chinese Han but the northern nomadic, who is the brother of Korean kingdoms.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Big Data Trend with Open Platform
1. Jongwook Woo
HiPIC
CalState
LA
SWRC 2017
San Diego, CA
Feb 25 2017
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Big Data Trend with
Open Platform
2. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
3. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Myself
Experience:
Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com
2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
4. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and
Research Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
5. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Experience in Big Data
Collaboration
Council Member of IBM Spark Technology Center
City of Los Angeles for OpenHub and Open Data
Startup Companies in Los Angeles
External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
Grants
IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
Partnership
Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
6. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
7. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
8. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
9. High Performance Information Computing Center
Jongwook Woo
CalState
LA
What is Hadoop?
9
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
10. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Cluster for Compute
Cluster for Store Cluster for Compute/Store
11. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
12. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Hadoop Cluster: Logical Diagram
Web Browser of
Cluster nonitor:
CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
13. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
14. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
15. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Only Map and Reduce
– Limited Parallelization
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
20 ~ 100 times faster than N/W and Disk
– MapReduce
16. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
17. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
ML /
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
18. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test
19. High Performance Information Computing Center
Jongwook Woo
CalState
LA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data
20. High Performance Information Computing Center
Jongwook Woo
CalState
LA
RDD and Data Frame Operations
Transformation
Define new RDDs and Data Frame from the
current
–Lazy: not computed immediately
map(), filter(), join(), select(), groupBy()
Actions
Return values
count(), collect(), take(), save()
21. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Programming in Spark
Scala
Functional Programming
– Fundamental of programming is function
• Input/Output is function
No side effects
– No states
Python
Legacy, large Libraries
Java
R
23. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
Spark SQL
Querying using SQL, HiveQL
Data Frame
ML
Machine Learning on Data Frame, Pipelining
MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA, SVM
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
24. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
25. High Performance Information Computing Center
Jongwook Woo
CalState
LA
During Scheduling Process
https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797
26. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
27. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
28. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
29. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Spark with Hadoop YARN
Spark Client
Slave Nodes
ResourceManager (RM) Per Cluster
Create Spark AM and
allocate Containers for Spark AM
NodeManager (NM) Per Node
Spark workers
ApplicationMaster (AM) Per Application
Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
31. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
32. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Platform
Open Source
Open Conference
Open Data
Public Data
33. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Source
Hadoop
http://hadoop.apache.org/
Spark
http://spark.apache.org/
NoSQL
http://hbase.apache.org/
Search Engine
http://lucene.apache.org/solr/
34. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Conference
Hadoop Summit
Live Streaming
–http://siliconangle.tv/hadoop-summit-
2016/
Spark Summit
https://spark-summit.org/east-2017/
Live Streaming
–http://go.spark-summit.org/east-
2017/live-
stream?_ga=1.62160364.1150099959.1484
851457
35. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
36. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
38. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Industrial Collaboration
Cloudera visits to interview Jongwook Woo
39. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Industrial Collaboration: IBM Bluemix
at CalStateLA
40. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Tableua, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
43. High Performance Information Computing Center
Jongwook Woo
CalState
LA
LOCAL BUSINESS DATA ANALYSIS
Using Local Business Data
From Yelp and Google Local
Grad Students at CalStateLA
Symposium, Feb 24 2017
Yashaswi Ananth
Ruchi Singh
Mahsa Tayer Farahani
44. High Performance Information Computing Center
Jongwook Woo
CalState
LA
REVIEW COUNT FOR BUSINESS TYPES
• Food
• Services
• Entertainment
• Shopping
• Medical
48. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top business
Top 5 most popular local business on Yelp between 2006-2016 in the selected cities
49. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
50. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Historical Analysis Of
College Scorecard
CalStateLA Symposium
Feb 24 2017
Kunal Pritwani
Atinder Singh
Dharmesh Soni
Mounika Vallabhaneni
51. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Data is collected from the site. :
https://www.kaggle.com/kaggle/college-scorecard
We have historical data of over 100,000 colleges in
the US spanning over 14 years.
Data Size – 1.33 GB
File Format – CSV ( Comma Separated Values)
Specification of Data Set
52. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mean Income
Medical college of Wisconsin: 250K
Upstate Medical University: 152.7K
CalTech: 103K
Washington and Lee University: 100K
53. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Comparing Average Net Price of Two
States (Annual Tuition)
UCLA: $13,817 CalStateLA: $4,370
Fashion Inst of Tech: $11.5K CUNY: $5K
54. High Performance Information Computing Center
Jongwook Woo
CalState
LA
SAT Scores in Different Colleges
Math (Blue), Verbal (Orange), Mean Earning (Purple)
• CalTech: 800, 778.9, $98.7K
• MIT: 800, 764.4, $124.4K
• Harvard: 791, 795.6, $133K
• Princeton: 793, 791, $115.6K
• Yale: 788, 794.4, $97.8K
55. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Comparing Average Undergraduates
Receiving PELL GRANT
Universal Career Community College: 100% PELL grant scholarship
56. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%
57. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Alphago vs Lee using Twitter
Data
Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
Data Size
63,193 tweets
Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
59. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries
# of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
Netherland, Spain,
Ukraine: > 600
60. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries Sentiment
Positive Negative
61. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Top 10 Countries
Most Tweeted Countries
All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
62. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
63. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Ngram words
3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
64. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Sentiment Map of Alphago
Positive
Negative
65. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google
The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
66. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
70. High Performance Information Computing Center
Jongwook Woo
CalState
LA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
71. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT
72. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month
74. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from CalStateLA
75. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from UCLA
76. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within
5miles from USC
77. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
78. High Performance Information Computing Center
Jongwook Woo
CalState
LA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc
79. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Contents
Myself
Big Data
Spark
Spark and Hadoop
Open Platform
Use Cases
Future Trend
80. High Performance Information Computing Center
Jongwook Woo
CalState
LA
Future Research Trend
Deep Learning
TensorFlow and Spark
– Yahoo, Intel, Google
– Image Recognition, Prediction Analysis
ChatBot
Amazon Alexa API
IBM Watson ChatBot API
Google Home API
More into
In-Memory Processing
– Spark DataFrame, Data Set, ML
Cloud Computing
– IBM Bluemix, MS Azure, Google Cloud, Amazon AWS