Your SlideShare is downloading. ×
0
jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing
on Networks
KISTI
Dae-Jeon, Korea
Sept 23rd 2013
Jongwook Woo (...
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Us...
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State Unive...
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certificate of Ach...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding H...
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud C...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than other...
High Performance Information Computing Center
Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Flume
Use Cases ex...
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byt...
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
H...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Market
Big Data Market in the world
$16.9 Bill...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
MapReduce
HDFS
Restricted Parallel ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Network Topology for Hadoop 1.0
Big Data Network Design ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagem...
High Performance Information Computing Center
Jongwook Woo
CSULA
Flume
Flume
 Real-time data migration to Hadoop
 Cloud...
High Performance Information Computing Center
Jongwook Woo
CSULA
Security Issues in Big Data
Can collect data from Social...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases on Networks
APT
BYOD
High Performance Information Computing Center
Jongwook Woo
CSULA
APT
APT (Advanced Persistent Threat)
 Select one target...
High Performance Information Computing Center
Jongwook Woo
CSULA
BYOD
BYOD (Bring Your Own Device)
 Personal Device for ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Possible Solutions
BYOD
 Hypervisors
–Two OSs for a dev...
High Performance Information Computing Center
Jongwook Woo
CSULA
Possible Solutions
Security Intelligence (SI)
 Analyze ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis at IglooSecurity Inc
...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyu...
High Performance Information Computing Center
Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 F...
High Performance Information Computing Center
Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Colle...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Co...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL...
High Performance Information Computing Center
Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect peo...
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comm...
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Netflix
Biggest Video Streaming company
Dominate Movie ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google...
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Ecosystems
Self-study
Are you sure ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and servic...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craig...
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws...
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung...
High Performance Information Computing Center
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– M...
High Performance Information Computing Center
Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of mill...
High Performance Information Computing Center
Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data In...
High Performance Information Computing Center
Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe a...
High Performance Information Computing Center
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwi...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: P...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wi...
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
Upcoming SlideShare
Loading in...5
×

Big Data and Data Intensive Computing on Networks

590

Published on

Big Data on Networks with Hadoop and its ecosystems (Giraph, Flume,...) at Korea Institute of Science and Technology Information. Illustrates some possible approach on Networks

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
590
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data and Data Intensive Computing on Networks"

  1. 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing on Networks KISTI Dae-Jeon, Korea Sept 23rd 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases on Networks  Training in Big Data  Big Data Supporters  Hadoop 2.0
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Introducing Cloudera material to Samsung, Korea
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ  Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Emerging Big Data Technology Giraph Flume Use Cases experienced
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA Big Data Market Big Data Market in the world $16.9 Billion in 2015 by IDC $53.4 Billion in 2017 by Wikibon Big Data Market in Korea Korea Information Society Development Institute – $263 Million in 2015 – $853 Million in 2020 Big Data in Information Communication Technology – 0.6% in 2013 – 2.3 % in 2020
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph Illustrate it with Ch3
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA Network Topology for Hadoop 1.0 Big Data Network Design Consideration by CISCO (http://www.cisco.com/en/US/prod/collateral/switches/ps9 441/ps9670/white_paper_c11-690561.html)
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA Giraph BSP Facebook http://www.slideshare.net/aladagemre/a-talk- on-apache-giraph
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA Flume Flume  Real-time data migration to Hadoop  Cloudera material
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA Security Issues in Big Data Can collect data from Social Networks Each data does not mean anything Data collected and related become meaning – Using Big Data to analyze data by hacker Big Data Analysis can be a shield too While it can be used by hackers
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA Use Cases on Networks APT BYOD
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA APT APT (Advanced Persistent Threat)  Select one target –Gov, Bank –By expert group – terrorist, hackers  Collect and analyze data from the site  Use the latest hacking technology
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA BYOD BYOD (Bring Your Own Device)  Personal Device for Biz –Efficient –Connect to the internal Data and network But Not secure –Lost the device –Exposed to open network out of office –Hacking the personal device to hack in the network
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA Possible Solutions BYOD  Hypervisors –Two OSs for a device • Private and Biz  Containerization –Two Data for an application • Private and Biz
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Possible Solutions Security Intelligence (SI)  Analyze IPS/IDS and Security events 3 Steps – Data Collection • Log Data, Event Data – Data Analyzing • Pattern Analysis, Relationship among data –Finding Solutions or Fixing the problems • Build Regulations Using Big Data for SI
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis at IglooSecurity Inc  Log files from IPS and IDS –1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL Netflix
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA Netflix Biggest Video Streaming company Dominate Movie Video industry Using Amazon AWS Customer Behavior Analysis Recommendation Systems Event to find out the fastest customer recommendation MR algorithm
  36. 36. High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  37. 37. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Ecosystems Self-study Are you sure if you know the detail? – Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of Reducers, … Training program Cloudera, Hortonworks – $2,500, Hands-on Exercises – About Hadoop, Hbase, Hive/Pig, Data Analysis, Data Mining etc Educational Partnership with Cloudera – Training ppl at Samsung using Cloudera’s material Educational Partnership with Hortonworks – Invited to train ppl at Big Data center of Gyung-gi province using Hortonworks’ material
  38. 38. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  39. 39. High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  40. 40. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  41. 41. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  42. 42. High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  43. 43. High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  44. 44. High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  45. 45. High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  46. 46. High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  47. 47. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  48. 48. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  49. 49. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …  Emerging Technology  Hadoop 2.0  Training is important
  50. 50. High Performance Information Computing Center Jongwook Woo CSULA Question?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×