This document provides an introduction to big data and artificial intelligence presented by Jongwook Woo. It discusses Woo's background and experience, provides an overview of big data including issues with traditional data handling approaches and the need for scalable solutions like Hadoop. It also covers machine learning and deep learning techniques for predictive analysis using big data, and provides examples applying these techniques to COVID-19 data and financial fraud detection.
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
History and Trend of Big Data and Deep LearningJongwook Woo
This document contains a presentation by Jongwook Woo on the history and trends of big data and deep learning. It discusses the evolution of data storage and analysis from traditional systems to modern big data platforms like Hadoop and Spark that can handle large, complex datasets in a distributed, cost-effective manner. It also covers the rise of deep learning techniques using neural networks and how they can be applied to big data at scale, such as for predictive analytics, using distributed deep learning frameworks on existing big data clusters.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
The document discusses Jongwook Woo and his background working with big data. It provides details on Woo's experience as a professor focusing on big data research and education partnerships. It also outlines some of the topics Woo covers in his presentations including introductions to big data, artificial intelligence, and the relationship between AI and big data. Key technologies like Hadoop, Spark, and neural networks are mentioned.
Introduction to Big Data: Smart FactoryJongwook Woo
Jongwook Woo presents an introduction to big data and smart factories. He discusses his background working with big data technologies and partnerships. The document then covers what big data is, common tools like Hadoop and Spark, and how big data is used in smart factories to collect, analyze and visualize machine data to improve operations. It concludes with a high-level summary of using big data for smart factory applications.
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
History and Trend of Big Data and Deep LearningJongwook Woo
This document contains a presentation by Jongwook Woo on the history and trends of big data and deep learning. It discusses the evolution of data storage and analysis from traditional systems to modern big data platforms like Hadoop and Spark that can handle large, complex datasets in a distributed, cost-effective manner. It also covers the rise of deep learning techniques using neural networks and how they can be applied to big data at scale, such as for predictive analytics, using distributed deep learning frameworks on existing big data clusters.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
The document discusses Jongwook Woo and his background working with big data. It provides details on Woo's experience as a professor focusing on big data research and education partnerships. It also outlines some of the topics Woo covers in his presentations including introductions to big data, artificial intelligence, and the relationship between AI and big data. Key technologies like Hadoop, Spark, and neural networks are mentioned.
Introduction to Big Data: Smart FactoryJongwook Woo
Jongwook Woo presents an introduction to big data and smart factories. He discusses his background working with big data technologies and partnerships. The document then covers what big data is, common tools like Hadoop and Spark, and how big data is used in smart factories to collect, analyze and visualize machine data to improve operations. It concludes with a high-level summary of using big data for smart factory applications.
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
This document discusses Jongwook Woo's work with Big Data AI at CalStateLA. It introduces Woo and his background, provides an overview of big data and how distributed systems enable scalable analysis of massive datasets. It also describes predictive analytics using machine learning and deep learning on big data, and how integrating GPUs into big data clusters can improve parallel processing for tasks like traffic analysis.
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
KMIS International Conference 2021.
This talk aims to provide insights and performance of predictive models for Airbnb Rating using Big Data and distributed parallel computing systems. We have predicted and classified using Two-Class Classification models if a property has a high or a low rating based on the features of the listing. It helps the hosts to know if their property is suitable and how their listing compares to other similar listings. We compare the results and the performance of rating prediction models with accuracy and computing time metrics.
Traffic Data Analysis and Prediction using Big DataJongwook Woo
- Denser traffic on Freeways 101, 405, 10
- Rush hours from 7 am to 9 am produce a lot of traffic, the heaviest traffic time start from 3pm and gets better after 6pm.
- Major areas of traffic in DTLA, Santa Monica, Hollywood
- More insights can be found with bigger dataset using this framework for analysis of traffic
- Using such data and platform can also give an opportunity to predict traffic congestions. Prediction can be performed using machine learning algorithm – Decision Forest with the accuracy of 83% for predicting the heaviest traffic jam.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkJongwook Woo
Jongwook Woo analyzed tweets about Alphago vs Lee Se-Dol's Go match using Hadoop and Spark on Azure HDInsights and IBM DashDB. The analysis found that the US and Japan tweeted the most about the match, with over 11,000 and 9,000 tweets respectively. Most tweets from all countries were positive in sentiment. Tweets peaked on days when games were played from March 9-15, 2016.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
Quick overview of the latest in big data and artificial intelligence. A lot of buzzwords being thrown around, hopefully this presentation will demystify many of the terms.
Big data refers to large volumes of diverse data that traditional database systems cannot effectively handle. With the rise of technologies like social media, sensors, and mobile devices, huge amounts of unstructured data are being generated every day. To gain insights from this "big data", alternative processing methods are needed. Hadoop is an open-source platform that can distribute data storage and processing across many servers to handle large datasets. Facebook uses Hadoop to store over 100 petabytes of user data and gain insights through analysis to improve user experience and target advertising. Organizations must prepare infrastructure like Hadoop to capture value from the growing "data tsunami" and enhance their business with big data analytics.
This document discusses analytics education in the era of big data. It begins with an overview of different terms used such as analytics, data mining, data science, and knowledge discovery. It then discusses trends in big data including the 3 V's of volume, velocity, and variety. It notes that skills and jobs in analytics are in high demand but there is a shortage of people with deep analytical skills. The document provides an overview of analytics education including various certificate programs and online courses available. It emphasizes that analytics education works best when combined with learning by doing through competitions and hands-on projects.
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
Data mining has evolved from simple data collection and analysis to more advanced techniques that can forecast future events, classify and cluster groups, associate related events, and sequence events over time. It involves finding patterns in data through interactive processes leveraging analysis technologies. Examples of data mining applications include fraud detection, credit scoring, failure prediction, and customer profiling to improve retention and profitability.
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
The document provides an overview of data science through an introduction by Sreejith C, a data scientist. It defines data science as discovering unknown information from data, obtaining predictive insights, creating impactful data products, and communicating business stories from data. A data scientist's work includes tasks like authoring data processing pipelines, performing analyses, and communicating results. The document also demonstrates a loan prediction problem using machine learning algorithms like logistic regression, decision trees, and random forests in Python.
This document provides an overview of a data science course. It discusses topics like big data, data science components, use cases, Hadoop, R, and machine learning. The course objectives are to understand big data challenges, implement big data solutions, learn about data science components and prospects, analyze use cases using R and Hadoop, and understand machine learning concepts. The document outlines the topics that will be covered each day of the course including big data scenarios, introduction to data science, types of data scientists, and more.
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
This paper compares the performance of scalable predictive analysis models using XGBoost in Big Data. The performance measurement is based on the training computing time and accuracy with AUR and Precision of a model. We developed XGBoost classification models with Airbnb listing dataset that predict the recommendation of the listings. The models are built in PySpark Rapids, BigDL, and H2O Sparkling with CPU and GPU on AWS EMR. We observed that BigDL with GPU is 25 – 50% faster training time than other platforms. H2O Sparkling has 5 - 7% better AUC and 0.7% better Precision than others.
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
This document discusses Jongwook Woo's work with Big Data AI at CalStateLA. It introduces Woo and his background, provides an overview of big data and how distributed systems enable scalable analysis of massive datasets. It also describes predictive analytics using machine learning and deep learning on big data, and how integrating GPUs into big data clusters can improve parallel processing for tasks like traffic analysis.
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
KMIS International Conference 2021.
This talk aims to provide insights and performance of predictive models for Airbnb Rating using Big Data and distributed parallel computing systems. We have predicted and classified using Two-Class Classification models if a property has a high or a low rating based on the features of the listing. It helps the hosts to know if their property is suitable and how their listing compares to other similar listings. We compare the results and the performance of rating prediction models with accuracy and computing time metrics.
Traffic Data Analysis and Prediction using Big DataJongwook Woo
- Denser traffic on Freeways 101, 405, 10
- Rush hours from 7 am to 9 am produce a lot of traffic, the heaviest traffic time start from 3pm and gets better after 6pm.
- Major areas of traffic in DTLA, Santa Monica, Hollywood
- More insights can be found with bigger dataset using this framework for analysis of traffic
- Using such data and platform can also give an opportunity to predict traffic congestions. Prediction can be performed using machine learning algorithm – Decision Forest with the accuracy of 83% for predicting the heaviest traffic jam.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkJongwook Woo
Jongwook Woo analyzed tweets about Alphago vs Lee Se-Dol's Go match using Hadoop and Spark on Azure HDInsights and IBM DashDB. The analysis found that the US and Japan tweeted the most about the match, with over 11,000 and 9,000 tweets respectively. Most tweets from all countries were positive in sentiment. Tweets peaked on days when games were played from March 9-15, 2016.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
Quick overview of the latest in big data and artificial intelligence. A lot of buzzwords being thrown around, hopefully this presentation will demystify many of the terms.
Big data refers to large volumes of diverse data that traditional database systems cannot effectively handle. With the rise of technologies like social media, sensors, and mobile devices, huge amounts of unstructured data are being generated every day. To gain insights from this "big data", alternative processing methods are needed. Hadoop is an open-source platform that can distribute data storage and processing across many servers to handle large datasets. Facebook uses Hadoop to store over 100 petabytes of user data and gain insights through analysis to improve user experience and target advertising. Organizations must prepare infrastructure like Hadoop to capture value from the growing "data tsunami" and enhance their business with big data analytics.
This document discusses analytics education in the era of big data. It begins with an overview of different terms used such as analytics, data mining, data science, and knowledge discovery. It then discusses trends in big data including the 3 V's of volume, velocity, and variety. It notes that skills and jobs in analytics are in high demand but there is a shortage of people with deep analytical skills. The document provides an overview of analytics education including various certificate programs and online courses available. It emphasizes that analytics education works best when combined with learning by doing through competitions and hands-on projects.
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
Data mining has evolved from simple data collection and analysis to more advanced techniques that can forecast future events, classify and cluster groups, associate related events, and sequence events over time. It involves finding patterns in data through interactive processes leveraging analysis technologies. Examples of data mining applications include fraud detection, credit scoring, failure prediction, and customer profiling to improve retention and profitability.
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
The document provides an overview of data science through an introduction by Sreejith C, a data scientist. It defines data science as discovering unknown information from data, obtaining predictive insights, creating impactful data products, and communicating business stories from data. A data scientist's work includes tasks like authoring data processing pipelines, performing analyses, and communicating results. The document also demonstrates a loan prediction problem using machine learning algorithms like logistic regression, decision trees, and random forests in Python.
This document provides an overview of a data science course. It discusses topics like big data, data science components, use cases, Hadoop, R, and machine learning. The course objectives are to understand big data challenges, implement big data solutions, learn about data science components and prospects, analyze use cases using R and Hadoop, and understand machine learning concepts. The document outlines the topics that will be covered each day of the course including big data scenarios, introduction to data science, types of data scientists, and more.
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
This paper compares the performance of scalable predictive analysis models using XGBoost in Big Data. The performance measurement is based on the training computing time and accuracy with AUR and Precision of a model. We developed XGBoost classification models with Airbnb listing dataset that predict the recommendation of the listings. The models are built in PySpark Rapids, BigDL, and H2O Sparkling with CPU and GPU on AWS EMR. We observed that BigDL with GPU is 25 – 50% faster training time than other platforms. H2O Sparkling has 5 - 7% better AUC and 0.7% better Precision than others.
This document summarizes a presentation on big data trends and open data. It introduces the speaker, Jongwook Woo, and his experience in big data. It then covers topics including what is big data, Hadoop and Spark frameworks, using open data for analysis, and examples of analyzing Twitter data on AlphaGo and government airline and crime data sets.
This document provides an overview of big data concepts including definitions of big data, sources of big data, and uses of big data analytics. It discusses technologies used for big data including Hadoop, MapReduce, Hive, Mahout, MATLAB, and Revolution R. It also addresses challenges around big data such as lack of standardization and extracting meaningful insights from large datasets.
Big Data and Advanced Data Intensive ComputingJongwook Woo
MapReduce is not working well at real time processing and iterative algorithm, which are mostly for machine learning and graph algorithms. This slide shows Spark, Giraph and Hadoop use cases in Science not in Business.
The document is a presentation by Jongwook Woo from the High-Performance Information Computing Center (HiPIC) at California State University Los Angeles given on February 25, 2017 at the SWRC conference in San Diego, CA. It discusses big data trends with open platforms and provides information on Spark, Hadoop, open data, use cases, and the future of big data. Specifically, it summarizes Jongwook Woo's background and experience, describes what big data is and how Spark improves on Hadoop MapReduce, discusses how Spark can integrate with Hadoop ecosystems, and provides examples of analyzing local business data using Spark.
Benefiting from Semantic AI along the data life cycleMartin Kaltenböck
Slides of 1 hour session of Martin Kaltenböck (CFO and Managing Partner of Semantic Web Company / PoolParty Software Ltd) on 19 March 2019 in Boston, US at the Enterprise Data World 2019, with its title: Benefiting from Semantic AI along the data life cycle.
Big Data and Data Intensive Computing on NetworksJongwook Woo
Big Data on Networks with Hadoop and its ecosystems (Giraph, Flume,...) at Korea Institute of Science and Technology Information. Illustrates some possible approach on Networks
This is a talk about Big Data, focusing on its impact on all of us. It also encourages institution to take a close look on providing courses in this area.
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
Joe Caserta, President at Caserta Concepts addressed the challenges of Business Intelligence in the Big Data world at the Third Annual Great Lakes BI Summit in Detroit, MI on Thursday, March 26. His talk "Architecting for Big Data: Trends, Tips and Deployment Options," focused on how to supplement your data warehousing and business intelligence environments with big data technologies.
For more information on this presentation or the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/.
This document outlines the course content for a Big Data Analytics course. The course covers key concepts related to big data including Hadoop, MapReduce, HDFS, YARN, Pig, Hive, NoSQL databases and analytics tools. The 5 units cover introductions to big data and Hadoop, MapReduce and YARN, analyzing data with Pig and Hive, and NoSQL data management. Experiments related to big data are also listed.
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
This is a power point presentation on Hadoop and Big Data. This covers the essential knowledge one should have when stepping into the world of Big Data.
This course is available on hadoop-skills.com for free!
This course builds a basic fundamental understanding of Big Data problems and Hadoop as a solution. This course takes you through:
• This course builds Understanding of Big Data problems with easy to understand examples and illustrations.
• History and advent of Hadoop right from when Hadoop wasn’t even named Hadoop and was called Nutch
• What is Hadoop Magic which makes it so unique and powerful.
• Understanding the difference between Data science and data engineering, which is one of the big confusions in selecting a carrier or understanding a job role.
• And most importantly, demystifying Hadoop vendors like Cloudera, MapR and Hortonworks by understanding about them.
This course is available for free on hadoop-skills.com
Big Data and Data Intensive Computing: Use CasesJongwook Woo
This invited talk was held by LG Data Mining Lab at LG R&D center, Woomyun-dong, Seoul, Korea. Introduces the emerging Hadoop ecosystems: Giraph, Spark, Shark, Flume and the use cases using Big Data in Korea and US. And, illustrates the importance of taking training.
This document provides an overview of a Hadoop session that will cover:
1. An introduction to big data including the history and evolution of Hadoop and how it addresses challenges with traditional databases.
2. The Hadoop architecture and ecosystem including components like HDFS, MapReduce, HBase and how they address issues with scalability, flexibility and cost compared to traditional databases.
3. Hands-on analysis of a soccer dataset using Hadoop to perform tasks like data classification, prediction and player analysis.
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
This document provides an overview of Jongwook Woo's background and experience working with big data and Hadoop. It discusses Woo's role as a professor teaching big data courses, partnerships with Cloudera and Amazon AWS, publications on Hadoop and NoSQL databases, and certificates earned in big data training. It also summarizes key aspects of big data, including the rise of unstructured and large-scale data, issues with relational databases at scale, and the two core components of Hadoop - HDFS for storage and MapReduce for distributed processing. Finally, it provides an example MapReduce job for sorting URLs by number of hits.
Radoop is a tool that integrates Hadoop, Hive, and Mahout capabilities into RapidMiner's user-friendly interface. It allows users to perform scalable data analysis on large datasets stored in Hadoop. Radoop addresses the growing amounts of structured and unstructured data by leveraging Hadoop's distributed file system (HDFS) and MapReduce framework. Key benefits of Radoop include its scalability for large data volumes, its graphical user interface that eliminates ETL bottlenecks, and its ability to perform machine learning and analytics on Hadoop clusters.
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
This document provides an overview of Big Data and Data Intensive Computing presented by Jongwook Woo. It discusses Woo's background and experience working with Big Data. Examples of Big Data use cases in Korea are presented, including for SK Telecom, Seoul city planning, credit cards, and Hyundai Motors. Issues dealing with large-scale data in traditional RDBMS systems are outlined. Key aspects of Big Data, including MapReduce and Hadoop, are introduced.
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
Presented at the ACEMS workshop at QUT in February 2015.
Credits: whole project team (names listed in the first slide).
Approved by CSIRO to be shared externally.
Similar to Introduction to Big Data and AI for Business Analytics and Prediction (20)
How To Use Artificial Intelligence (AI) in HistoryJongwook Woo
The integration of Information Technology (IT) and Artificial Intelligence (AI) is revolutionizing the study of history. AI’s translation capabilities make Chinese history books accessible to a wider audience, while spatial analysis offers new insights into historical contexts. Map tools like Baidu and Google Maps simplify the process of locating historical sites. Thus, employing IT and AI is essential for modern historical research.
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
South Korea historians trained under Imperial Japan have believe that the tombs in Pyungyang belong to the Chinese Han. Dr Moon points out that the tombs have the similar remains to the northern nomadic, who might be the Hun/HyoongNo. He provides many evidence why it should not belong to the Chinese Han but the northern nomadic, who is the brother of Korean kingdoms.
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkJongwook Woo
This document summarizes an analysis of tweets about Alphago vs Lee Se-Dol from March 12-17, 2016 using Hadoop and Spark. It finds that the US and Japan tweeted the most about the match, with most tweets being positive. The top tweeted hashtags were #Alphago. Daily tweets peaked at the times of matches and when Lee Se-Dol won game 4. The analysis also examined sentiment, gender, and monthly trends of those tweeting about the match using IBM DashDB.
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
This document provides a summary of a presentation given by Jongwook Woo on introducing Spark for data analysis and use cases in big data. The presentation covered Spark cores, RDDs, Spark SQL, streaming and machine learning. It also described experimental results analyzing an airline data set using Spark and Hive on Microsoft Azure, including visualizations of cancelled/diverted flights by month and year and the effects of flight distance on diversions, cancellations and departure delays.
Big Data Analysis and Industrial Approach using SparkJongwook Woo
The document discusses Jongwook Woo presenting on big data analysis using Spark. It includes an introduction to himself and his experience in big data. It then covers topics like Hive examples on airline data, Spark cores and RDDs, Spark SQL, streaming and machine learning. It discusses market basket analysis examples on Spark and concludes with academic cloud computing.
- The document discusses a presentation given by Jongwook Woo on introducing Spark and its uses for big data analysis. It includes information on Woo's background and experience with big data, an overview of Spark and its components like RDDs and task scheduling, and examples of using Spark for different types of data analysis and use cases.
Introduction To Big Data and Use Cases using HadoopJongwook Woo
This document provides an introduction to big data and use cases using Hadoop presented by Jongwook Woo. It discusses Woo's background and experience working with big data technologies. It then covers emerging big data technologies, Hadoop versions 1 and 2, common use cases experienced including log analysis and customer behavior analysis, and how universities can support research and training in big data.
Introduction To Big Data and Use Cases on HadoopJongwook Woo
Jongwook Woo gave a presentation on big data and Hadoop to the Seoul Technology Society. He discussed his background working with big data technologies and his partnership with Cloudera. He then explained the core challenges of big data in terms of storing and computing large datasets. Woo described how Hadoop provides an inexpensive framework to address these challenges through its HDFS distributed file system and MapReduce programming model. He highlighted several use cases organizations have implemented on Hadoop and discussed new technologies in Hadoop 2.0 like YARN and Impala.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Influence of Marketing Strategy and Market Competition on Business Plan
Introduction to Big Data and AI for Business Analytics and Prediction
1. Jongwook Woo
HiPIC
CalStateLA
Marketing Analytics Research Society
(M.A.R.S.)
Oct 7 2020
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Introduction to Big Data and AI
for Business Analytics and Prediction
2. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data AI Predictive Analysis
Summary
3. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
4. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
5. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: CDH, Oracle using Hadoop Big Data
6. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Partners for Services
7. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Collaborations
SOFTZEN
8. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data AI Predictive Analysis
Summary
9. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– IoT (Streaming data, Sensor Data) in SmartX
– Social Computing, smart phone, online game
– Bioinformatics, …
Legacy approach
Can do
– Improve the speed of CPU
Increase the storage size
Only Problem
– Too expensive
10. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
11. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
12. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large Scale Data
Too big
Non-/Semi-structured data
3 Vs, 4 Vs,…
– Velocity, Volume, Variety
Traditional Systems can handle them
– But Again, Too expensive
Cannot handle with the legacy approach
Need new systems
Non-expensive
13. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
14. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
15. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan
16. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family:
최해산(崔海山), 아버지 최무선(崔茂宣)
[Ref] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
17. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
…
18. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Super Computer vs Big Data vs Cloud
Traditional Super Computer
(Parallel File Systems: Lustre, PVFS, GPFS)
Cluster for Store
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
However, Cloud Computing adopts
this separated architecture:
with High Speed N/W (> 10Gbps)
and Object Storage
Cluster for Compute
19. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive platform, which is distributed parallel computing
systems and that can store a large scale data and process it in
parallel [1, 2]
Apache Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• Anyone can own super computer as open source
– In your university labs, small companies, research centers
Other solutions with storage and computing services
– Spark
• mostly integrated into Hadoop with Hadoop community
– NoSQL DB (Cassandra, MongoDB, Redis, Hbase,…)
– ElasticSearch
20. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
What is Hadoop?
20
Apache Hadoop Project in
Jan, 2006 split from Nutch
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
21. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers
22. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative
23. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
24. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
25. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)
26. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data AI Predictive Analysis
Summary
27. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction
Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
Big Data Science
How to predict the future trend and pattern with the massive
dataset? => Machine Learning
28. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark
Parallel Computing Engine
Spark by UC Berkley AMP Lab
Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms
29. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models
30. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
Machine Learning
Has been popular since Google Tensorflow, Nov 9 2015
Multiple Cores in GPU
– Even with multiple GPUs and CPUs
Parallel Computing in a chip
GPU (Nvidia GTX 1660 Ti)
1280 CUDA cores
Other Deep Learning Libraries
Tensor Flow
PyTorch
Keras
Caffe, Caffe2
Microsoft Cognitive Toolkit (Previously CNTK)
Apache Mxnet
DeepLearning4j
…
32. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
NLP for classification, Prediction
RNN
Time Series Prediction
Speech Recognition/Synthesis
Image/Video Captioning
Text Analysis
– Conversation Q&A
GAN
Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces
33. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016
34. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep learning experts
The
Chasm
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]
35. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Single server for
Python and R
Traditional
Machine Learning?
Big Data Cluster
36. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?
37. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster
38. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
TensorFlowOnSpark
Yahoo! Inc
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas
39. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data AI Predictive Analysis: Use Case
Summary
40. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
COVID 19 Dashboard
https://www.calstatela.edu/centers/hipic/covid-19-us-ca-confirmed-prediction
41. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Financial Data Set
Priyanka Purushu, Jongwook Woo, "Financial Fraud Detection
adopting Distributed Deep Learning in Big Data",
KSII The 15th Asia Pacific International Conference on Information Science
and Technology (APIC-IST) 2020, July 5 -7 2020, Seoul, Korea, pp271-273,
ISSN 2093-0542
No public available datasets on financial services
private nature of financial transactions
– specially in the mobile money transactions domain
PaySim
URL: https://www.kaggle.com/ntnu-testimon/paysim1
42. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB
6,362,620 records
Not that large scale data comparing to data set > GB
But the Big Data architecture can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
Attributes: 11
Predictive Analysis
The target column to predict fraud :
– ‘isFraud’
43. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Understanding
Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest
Categorical attributes:
step, type, isFraud, isFlaggedFraud
String attributes:
nameOrig, nameDest
44. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Comparing Spark ML and DDL for fraud detection
Spark ML algorithms
DT (Decision Tree)
RF (Random Forest)
– Performance
• 53 minutes
• Best in Precision: 0.959
LR (Linear Regression): Fastest 24 minutes
DDL: Distributed Deep Learning in Spark
Forward Feed (FF)
– a neural network system
– Performance
• 51 minutes
• Best in Recall: 0.938
45. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
46. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Accuracy and Performance
Model Precision Recall Computing
Time (mins)
DT 0.946 0.889 29
RF 0.959 0.909 53
LR 0.902 0.655 24
FF 0.880 0.938 51
47. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AWS Review Dataset
Monika Mishra, Mingoo Kang, Jongwook Woo, “Rating Prediction using Deep
Learning and Spark”,
The 11th International Conference on Internet (ICONI 2019), pp307-310, Dec 15-18 2019,
Hanoi, Vietnam
Predictive Analysis
Prediction of Users’ ratings
– important measures for purchase and selling
Spark ML: ALS (Alternating Least Squares) algorithm
DDL (Distributed Deep Learning): Neural Collaborative Filtering(NCF)
Dataset : - https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
Products reviewed between 2005 and 2015 are analyzed
Total product reviews : 9.57 million
File Size : 5.26 GB
48. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
49. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Mean Absolute Error
50. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
What To Do?
Predictive Analysis
Big Data Analyst & Scientist
– Learn the domain of Marketing?
Marketing Experts
– Learn the cutting edge tech: machine learning, AI and Big Data technology?
Need Collaboration instead
Big Data AI
Domain Expert in Marketing
Have coffee and talk occasionally
51. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data AI Predictive Analysis
Summary
52. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Spark ML for Big Data Science
Distributed Deep Learning with Spark
DDL provides more accuracy with the similar performance by
leveraging the Big Data cluster
Collaboration and Coffee time Needed
53. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
54. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
55. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
56. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/