Big Data and Data Intensive Computing: Use Cases
Upcoming SlideShare
Loading in...5
×
 

Big Data and Data Intensive Computing: Use Cases

on

  • 585 views

This invited talk was held by LG Data Mining Lab at LG R&D center, Woomyun-dong, Seoul, Korea. Introduces the emerging Hadoop ecosystems: Giraph, Spark, Shark, Flume and the use cases using Big Data ...

This invited talk was held by LG Data Mining Lab at LG R&D center, Woomyun-dong, Seoul, Korea. Introduces the emerging Hadoop ecosystems: Giraph, Spark, Shark, Flume and the use cases using Big Data in Korea and US. And, illustrates the importance of taking training.

Statistics

Views

Total Views
585
Views on SlideShare
584
Embed Views
1

Actions

Likes
0
Downloads
19
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data and Data Intensive Computing: Use Cases Big Data and Data Intensive Computing: Use Cases Presentation Transcript

  • jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Use Cases LG Woo-Myon-Dong, Korea Sept 12th 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases  Training in Big Data  Big Data Supporters  Hadoop 2.0
  • High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Introducing Cloudera material to Samsung, Korea
  • High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ  Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,
  • High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • High Performance Information Computing Center Jongwook Woo CSULA Emerging Big Data Technology Giraph Spark and Shark Flume Use Cases experienced
  • High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  • High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph
  • High Performance Information Computing Center Jongwook Woo CSULA Giraph BSP Facebook http://www.slideshare.net/aladagemre/a-talk- on-apache-giraph
  • High Performance Information Computing Center Jongwook Woo CSULA Spark and Shark High Speed In-Memory Analytics over Hadoop and Hive data http://www.slideshare.net/Hadoop_Summit/s park-and-shark  Fast Data Sharing –Iterative Graph Algorithms –Interactive Query
  • High Performance Information Computing Center Jongwook Woo CSULA Flume Flume  Real-time data migration to Hadoop  Cloudera material
  • High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis at IglooSecurity Inc  Log files from IPS and IDS –1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API
  • High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  • High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  • High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  • High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  • High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  • High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL Netflix
  • High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  • High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  • High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  • High Performance Information Computing Center Jongwook Woo CSULA Netflix Biggest Video Streaming company Dominate Movie Video industry Using Amazon AWS Customer Behavior Analysis Recommendation Systems Event to find out the fastest customer recommendation MR algorithm
  • High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  • High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Ecosystems Self-study Are you sure if you know the detail? – Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of Reducers, … Training program Cloudera, Hortonworks – $2,500, Hands-on Exercises – About Hadoop, Hbase, Hive/Pig, Data Analysis, Data Mining etc Educational Partnership with Cloudera – Training ppl at Samsung using Cloudera’s material Educational Partnership with Hortonworks – Invited to train ppl at Big Data center of Gyung-gi province using Hortonworks’ material
  • High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  • High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  • High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  • High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  • High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  • High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  • High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  • High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  • High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  • High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …  Emerging Technology  Hadoop 2.0  Training is important
  • High Performance Information Computing Center Jongwook Woo CSULA Question?