To subscribe to the bd-ucla mailing list for seminar announcement, please visit this page
BD-UCLA (Big Data - UCLA, formely DB-UCLA) Seminar : Current Schedule
Time: 12:00pm-1:00pm Fridays; Room: 3551P Boelter Hall
*To invite a guest speaker or to schedule a talk, contact Mohan Yang (yang at cs dot ucla dot edu)
|10/25||Mohan Yang||In Memory Evaluation of Recursive Queries on Multi-core Machines|
|12/06||Prof. Lixia Zhang||Evolving Internet into the Future via Named Data Networking|
|04/05||Sung Jin Kim||Linear Trend Analysis and Estimation of Optimizer Statistics in Teradata 14|
|04/19||Young Cha||Incorporating Popularity in Topic Models for Social Network Analysis|
|01/18||Kai Zeng||Incremental Learning for Big Data|
|02/01||Hamid Mousavi||SemScape: The NLP-Based Text Mining Framework|
|02/22||Prof. Tyson Condie||Big Learning Systems|
Speaker:Prof. Lixia Zhang
Abstract:The success of TCP/IP protocol architecture has brought us an explosive growth of Internet applications. Since applications operate in terms of data and more end points become mobile, however, it becomes increasingly difficult and inefficient to satisfy IP's requirement of determining exactly where (at which IP address) to find desired data. The Named Data Networking project (NDN) aims to carry the Internet into the future through a conceptually simple yet transformational architecture shift, from today's focus on where -- addresses and hosts -- to what -- the data that users and applications care about. By naming data instead of their locations, NDN transforms data into first-class entities, enabling direct security of data instead of data containers as well as radically scalable communication mechanisms such as multicast delivery and in-network storage.
Abstract:Datalog is a powerful language that can express recursive queries like transitive closure, shortest path, bill of materials, etc. In this work, we study the in memory evaluation of these recursive queries on current multi-core machines. We describe different evaluation algorithms and our parallel implementation of these algorithms. We report the serial execution performance and parallel execution performance of these algorithms on our test data.
Abstract:In this paper, we propose topic models to deal with social network data. Our topic models are specialized in dealing with "popularity bias" caused by dominance of a limited number of popular user (or node) in a dataset. These popular nodes have been simply removed in topic models because they do not have much meaning (e.g., the and is). However, in a social network dataset, most people are interested in popular users (e.g., Barack Obama and Britney Spears) and they should be carefully handled. To solve this problem, we introduce a notion of "popularity component" and explore various ways to e ectively incorporate it. Through extensive experiments, we show that our proposed models achieve signicant improvements over the existing models in terms of lowering "perplexity". We also show that the outgoing edge degree (how many people a user follows) does not help much in achieving the lower perplexity. Our models can be useful in providing more accurate recommendations and clusterings for various services including social network services.
Speaker:Sung Jin Kim
Abstract:Optimizer statistics play a crucial role for database query optimizers in finding optimal query execution plans for given queries. DBAs recollect statistics periodically to maintain statistics up-to-date, since stale statistics could mislead query optimizers into choosing non-optimal plans. In a large database system, refreshing all their stale statistics is often prohibitively expensive and fails to be done within a limited amount of time. Teradata 14 introduced a novel approach to estimate the up-to-date statistics based on a linear trend analysis on historical changes of statistics. This approach allows a Query Optimizer to use the up-to-date statistics in searching optimal execution plans. The approach had been experimented with synthesized data sets and 5-year customer data sets. The experiments showed that the new approach is highly accurate and the Teradata Query Optimizer is able to find optimal execution plans without collecting statistics so often.
Speaker:Prof. Tyson Condie
Abstract:A new wave of systems is emerging in the space of Big Data Analytics that opens the door to programming models beyond Hadoop MapReduce (HMR). It is well understood that HMR is not ideal for applications in the domain of machine learning and graph processing. This realization is fueling a new series of Big Data systems: Berkeley Spark, Google Pregel, GraphLab (CMU) and Hyracks (UC Irvine), to name a few. Each of these add unique capabilities, but form islands around key functionalities: fault-tolerance, resource allocation, and data caching. In this talk, I will provide an overview of Big Data Systems starting with Google's MapReduce, which defined the foundational architecture for processing large data sets. I will then identify a key limitation of this architecture; namely, its inability to efficiently support iterative workflows. I will then describe real-world examples of systems that aim to fill this computational void and argue that all these designs are flawed in some regard. I will conclude with a description of my own work on building a Big Data Application Server that unifies the key runtime functionalities (fault-tolerance, resource allocation, data caching, and more) for workflows (both iterative and acyclic) that process large data sets.
Abstract:SemScape (Stands for SEMantic SCAPE) is an NLP-based framework for mining unstructured or free text. SemScape arose from a collaboration between the National Center for Research on Evaluation, Standards, and Student Testing (CRESST) and the Computer Science Department (CSD) at the University of California, Los Angeles. The ultimate goal of SemScape is to convert text into a machine-friendly structure, called TextGraph, which contains grammatical relations between terms and words in the text. In machines point of view, TextGraphs contains more semantic information than the sequence of keywords. In order to do so, SemScape uses linguistic morphologies to extract concepts and relations simultaneously. These linguistic morphologies are captured through several manually generated patterns over parse trees generated using parsers such as Charniak or Stanford. In this talk, I will shortly introduce the main ideas behind SemScape, talk about its unique features, discuss current challenges, and go over some applications of SemScape we are working on at this moment. You may also find some interesting information about our projects in the following website: http://qonosnlp.cse.ucla.edu/mapper/index.html
Abstract:The 'data deluge' we are experiencing presents data scientists with opportunities to collect petabyte-sized datasets with which they then train machine learning models for various tasks. However, blindly applying machine learning on datasets with more than necessary scale costs a huge waste on computing resources, and takes excessive time. Therefore, the ability to produce a fast approximate models from a small sample of training data becomes essential to large scale data processing platforms, such as MapReduce. Unfortunately, existing platforms lack the functionalities of sampling and incremental learning, and thus is not well-suited for such fast approximate ML tasks. We propose a new data analysis platform that employs bootstrap technique and loop-aware/resource-aware scheduling for incremental and intelligent learning on massive scale datasets.